On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@xxxxxxxxx> wrote: > From: Bijan Tabatabai <bijantabatab@xxxxxxxxxx> > > A recent patch set automatically set the interleave weight for each node > according to the node's maximum bandwidth [1]. In another thread, the patch > set's author, Joshua Hahn, wondered if/how these weights should be changed > if the bandwidth utilization of the system changes [2]. Hi Bijan, Thank you for this patchset, and thank you for finding interest in my question! > This patch set adds the mechanism for dynamically changing how application > data is interleaved across nodes while leaving the policy of what the > interleave weights should be to userspace. It does this by adding a new > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both > paddr and vaddr operations sets. Using the paddr version is useful for > managing page placement globally. Using the vaddr version limits tracking > to one process per kdamond instance, but the va based tracking better > captures spacial locality. > > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N> > and the page placement algorithm in weighted_interleave_nid via > policy_nodemask. We chose to reuse the mempolicy weighted interleave > infrastructure to avoid reimplementing code. However, this has the awkward > side effect that only pages that are mapped to processes using > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave > weights. This might be fine because workloads that want their data to be > dynamically interleaved will want their newly allocated data to be > interleaved at the same ratio. I think this is generally true. Maybe until a user says that they have a usecase where they would like to have a non-weighted-interleave policy to allocate pages, but would like to place them according to a set weight, we can leave support for other mempolicies out for now. > If exposing policy_nodemask is undesirable, we have two alternative methods > for having DAMON access the interleave weights it should use. We would > appreciate feedback on which method is preferred. > 1. Use mpol_misplaced instead > pros: mpol_misplaced is already exposed publically > cons: Would require refactoring mpol_misplaced to take a struct vm_area > instead of a struct vm_fault, and require refactoring mpol_misplaced and > get_vma_policy to take in a struct task_struct rather than just using > current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE. > 2. Add a new field to struct damos, similar to target_nid for the > MIGRATE_HOT/COLD schemes. > pros: Keeps changes contained inside DAMON. Would not require processes > to use MPOL_WEIGHTED_INTERLEAVE. > cons: Duplicates page placement code. Requires discussion on the sysfs > interface to use for users to pass in the interleave weights. Here I agree with SJ's sentiment -- I think mpol_misplaced runs with the context of working with current / fault contexts, like you pointed out. Perhaps it is best to keep the scope of the changes as local as possible : -) As for duplicating page placement code, I think that is something we can refine over iterations of this patchset, and maybe SJ will have some great ideas on how this can best be done as well. > This patchset was tested on an AMD machine with a NUMA node with CPUs > attached to DDR memory and a cpu-less NUMA node attached to CXL memory. > However, this patch set should generalize to other architectures and number > of NUMA nodes. I think moving the test results to the cover letter will help reviewers better understand the intent of the work. Also, I think it will also be very helpful to include some potential use-cases in here as well. That is, what workloads would benefit from placing pages according to a set ratio, rather than using existing migration policies that adjust this based on hotness / coldness? One such use case that I can think of is using this patchset + weighted interleave auto-tuning, which would help alleviate bandwidth limitations by ensuring that past the allocation stage, pages are being accessed in a way that maximizes the bandwidth usage of the system (at the cost of latency, which may or may not even be true based on how bandwidth-bound the workload is). Thank you again for the amazing patchset! Have a great day : -) Joshua Sent using hkml (https://github.com/sjp38/hackermail)