Hi Joshua, On Fri, Jun 13, 2025 at 10:25 AM Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote: > > On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@xxxxxxxxx> wrote: > > > From: Bijan Tabatabai <bijantabatab@xxxxxxxxxx> > > > > A recent patch set automatically set the interleave weight for each node > > according to the node's maximum bandwidth [1]. In another thread, the patch > > set's author, Joshua Hahn, wondered if/how these weights should be changed > > if the bandwidth utilization of the system changes [2]. > > Hi Bijan, > > Thank you for this patchset, and thank you for finding interest in my > question! > > > This patch set adds the mechanism for dynamically changing how application > > data is interleaved across nodes while leaving the policy of what the > > interleave weights should be to userspace. It does this by adding a new > > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both > > paddr and vaddr operations sets. Using the paddr version is useful for > > managing page placement globally. Using the vaddr version limits tracking > > to one process per kdamond instance, but the va based tracking better > > captures spacial locality. > > > > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the > > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N> > > and the page placement algorithm in weighted_interleave_nid via > > policy_nodemask. We chose to reuse the mempolicy weighted interleave > > infrastructure to avoid reimplementing code. However, this has the awkward > > side effect that only pages that are mapped to processes using > > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave > > weights. This might be fine because workloads that want their data to be > > dynamically interleaved will want their newly allocated data to be > > interleaved at the same ratio. > > I think this is generally true. Maybe until a user says that they have a > usecase where they would like to have a non-weighted-interleave policy > to allocate pages, but would like to place them according to a set weight, > we can leave support for other mempolicies out for now. > > > If exposing policy_nodemask is undesirable, we have two alternative methods > > for having DAMON access the interleave weights it should use. We would > > appreciate feedback on which method is preferred. > > 1. Use mpol_misplaced instead > > pros: mpol_misplaced is already exposed publically > > cons: Would require refactoring mpol_misplaced to take a struct vm_area > > instead of a struct vm_fault, and require refactoring mpol_misplaced and > > get_vma_policy to take in a struct task_struct rather than just using > > current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE. > > 2. Add a new field to struct damos, similar to target_nid for the > > MIGRATE_HOT/COLD schemes. > > pros: Keeps changes contained inside DAMON. Would not require processes > > to use MPOL_WEIGHTED_INTERLEAVE. > > cons: Duplicates page placement code. Requires discussion on the sysfs > > interface to use for users to pass in the interleave weights. > > Here I agree with SJ's sentiment -- I think mpol_misplaced runs with the > context of working with current / fault contexts, like you pointed out. > Perhaps it is best to keep the scope of the changes as local as possible : -) > As for duplicating page placement code, I think that is something we can > refine over iterations of this patchset, and maybe SJ will have some great > ideas on how this can best be done as well. David Hildenbrand responded to this and proposed adding a new function that just returns the nid a folio should go on based on its mempolicy. I think that's probably the best way to go for now. I think the common case would want the weights used by this and mempolicy to be the same. However, if there is a use case where different weights are desired, I don't mind coming back and adding that functionality. > > This patchset was tested on an AMD machine with a NUMA node with CPUs > > attached to DDR memory and a cpu-less NUMA node attached to CXL memory. > > However, this patch set should generalize to other architectures and number > > of NUMA nodes. > > I think moving the test results to the cover letter will help reviewers > better understand the intent of the work. Also, I think it will also be > very helpful to include some potential use-cases in here as well. That is, > what workloads would benefit from placing pages according to a set ratio, > rather than using existing migration policies that adjust this based on > hotness / coldness? Noted. I will be sure to include that in the next revision. > One such use case that I can think of is using this patchset + weighted > interleave auto-tuning, which would help alleviate bandwidth limitations > by ensuring that past the allocation stage, pages are being accessed > in a way that maximizes the bandwidth usage of the system (at the cost of > latency, which may or may not even be true based on how bandwidth-bound > the workload is). This was the exact use case I envisioned for this patch. I talk about it in more detail in my reply to SeongJae. > Thank you again for the amazing patchset! Have a great day : -) > Joshua I appreciate you taking the time to respond, Bijan