Re: [PATCH v9] mm/mempolicy: Weighted Interleave Auto-tuning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 20.05.25 16:12, Joshua Hahn wrote:
On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

Previously, weighted interleave's default weights were just 1s -- which
would be equivalent to the (unweighted) interleave mempolicy, which goes
through the nodes in a round-robin fashion, ignoring bandwidth information.

This patch has two main goals:
First, it makes weighted interleave easier to use for users who wish to
relieve bandwidth pressure when using nodes with varying bandwidth (CXL).
By providing a set of "real" default weights that just work out of the
box, users who might not have the capability (or wish to) perform
experimentation to find the most optimal weights for their system can
still take advantage of bandwidth-informed weighted interleave.

Second, it allows for weighted interleave to dynamically adjust to
hotplugged memory with new bandwidth information. Instead of manually
updating node weights every time new bandwidth information is reported
or taken off, weighted interleave adjusts and provides a new set of
default weights for weighted interleave to use when there is a change
in bandwidth information.

To meet these goals, this patch introduces an auto-configuration mode
for the interleave weights that provides a reasonable set of default
weights, calculated using bandwidth data reported by the system. In auto
mode, weights are dynamically adjusted based on whatever the current
bandwidth information reports (and responds to hotplug events).

This patch still supports users manually writing weights into the nodeN
sysfs interface by entering into manual mode. When a user enters manual
mode, the system stops dynamically updating any of the node weights,
even during hotplug events that shift the optimal weight distribution.

A new sysfs interface "auto" is introduced, which allows users to switch
between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
system also automatically enters manual mode when a nodeN interface is
manually written to.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent. With this patch, writing 0 is invalid.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@xxxxxxxxxxxx/

Suggested-by: Yunjeong Mun <yunjeong.mun@xxxxxx>
Suggested-by: Oscar Salvador <osalvador@xxxxxxx>
Suggested-by: Ying Huang <ying.huang@xxxxxxxxxxxxxxxxx>
Suggested-by: Harry Yoo <harry.yoo@xxxxxxxxxx>
Tested-by: Honggyu Kim <honggyu.kim@xxxxxx>
Reviewed-by: Honggyu Kim <honggyu.kim@xxxxxx>
Reviewed-by: Harry Yoo <harry.yoo@xxxxxxxxxx>
Reviewed-by: Huang Ying <ying.huang@xxxxxxxxxxxxxxxxx>
Co-developed-by: Gregory Price <gourry@xxxxxxxxxx>
Signed-off-by: Gregory Price <gourry@xxxxxxxxxx>
Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>
---


[...]

-static void iw_table_free(void)
+static void wi_state_free(void)
  {
-	u8 *old;
+	struct weighted_interleave_state *old_wi_state;
- mutex_lock(&iw_table_lock);
-	old = rcu_dereference_protected(iw_table,
-					lockdep_is_held(&iw_table_lock));
-	rcu_assign_pointer(iw_table, NULL);
-	mutex_unlock(&iw_table_lock);
+	mutex_lock(&wi_state_lock);
+
+	old_wi_state = rcu_dereference_protected(wi_state,
+			lockdep_is_held(&wi_state_lock));
+	if (!old_wi_state) {
+		mutex_unlock(&wi_state_lock);
+		goto out;
+	}
+ rcu_assign_pointer(wi_state, NULL);
+	mutex_unlock(&wi_state_lock);

Just one nit: if written as:

...
rcu_assign_pointer(wi_state, NULL);
mutex_unlock(&wi_state_lock);

old_wi_state = ...
if (old_wi_state) {
	synchronize_rcu();
	kfree(old_wi_state);
}
kfree(&wi_group->wi_kobj);

You can easily avoid the goto.

  	synchronize_rcu();
-	kfree(old);
+	kfree(old_wi_state);
+out:
+	kfree(&wi_group->wi_kobj);
  }

I'll note that this rather unrelated churn (renaming functions + variables) is a bit abd for review as it adds noise. Having that as part of a cleanup patch might have been better.

Nothing else jumped at me (did not an in-depth review of the logic)

Acked-by: David Hildenbrand <david@xxxxxxxxxx>

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux