Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes

Bijan Tabatabai <bijan311@xxxxxxxxx> · Fri, 13 Jun 2025 11:46:22 -0500

Hi Joshua,

On Fri, Jun 13, 2025 at 10:25 AM Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
>
> On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@xxxxxxxxx> wrote:
>
> > From: Bijan Tabatabai <bijantabatab@xxxxxxxxxx>
> >
> > A recent patch set automatically set the interleave weight for each node
> > according to the node's maximum bandwidth [1]. In another thread, the patch
> > set's author, Joshua Hahn, wondered if/how these weights should be changed
> > if the bandwidth utilization of the system changes [2].
>
> Hi Bijan,
>
> Thank you for this patchset, and thank you for finding interest in my
> question!
>
> > This patch set adds the mechanism for dynamically changing how application
> > data is interleaved across nodes while leaving the policy of what the
> > interleave weights should be to userspace. It does this by adding a new
> > DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> > paddr and vaddr operations sets. Using the paddr version is useful for
> > managing page placement globally. Using the vaddr version limits tracking
> > to one process per kdamond instance, but the va based tracking better
> > captures spacial locality.
> >
> > DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> > interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> > and the page placement algorithm in weighted_interleave_nid via
> > policy_nodemask. We chose to reuse the mempolicy weighted interleave
> > infrastructure to avoid reimplementing code. However, this has the awkward
> > side effect that only pages that are mapped to processes using
> > MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> > weights. This might be fine because workloads that want their data to be
> > dynamically interleaved will want their newly allocated data to be
> > interleaved at the same ratio.
>
> I think this is generally true. Maybe until a user says that they have a
> usecase where they would like to have a non-weighted-interleave policy
> to allocate pages, but would like to place them according to a set weight,
> we can leave support for other mempolicies out for now.
>
> > If exposing policy_nodemask is undesirable, we have two alternative methods
> > for having DAMON access the interleave weights it should use. We would
> > appreciate feedback on which method is preferred.
> > 1. Use mpol_misplaced instead
> >   pros: mpol_misplaced is already exposed publically
> >   cons: Would require refactoring mpol_misplaced to take a struct vm_area
> >   instead of a struct vm_fault, and require refactoring mpol_misplaced and
> >   get_vma_policy to take in a struct task_struct rather than just using
> >   current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> > 2. Add a new field to struct damos, similar to target_nid for the
> > MIGRATE_HOT/COLD schemes.
> >   pros: Keeps changes contained inside DAMON. Would not require processes
> >   to use MPOL_WEIGHTED_INTERLEAVE.
> >   cons: Duplicates page placement code. Requires discussion on the sysfs
> >   interface to use for users to pass in the interleave weights.
>
> Here I agree with SJ's sentiment -- I think mpol_misplaced runs with the
> context of working with current / fault contexts, like you pointed out.
> Perhaps it is best to keep the scope of the changes as local as possible : -)
> As for duplicating page placement code, I think that is something we can
> refine over iterations of this patchset, and maybe SJ will have some great
> ideas on how this can best be done as well.

David Hildenbrand responded to this and proposed adding a new function that
just returns the nid a folio should go on based on its mempolicy. I think that's
probably the best way to go for now. I think the common case would want
the weights used by this and mempolicy to be the same. However, if there is
a use case where different weights are desired, I don't mind coming back and
adding that functionality.

> > This patchset was tested on an AMD machine with a NUMA node with CPUs
> > attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> > However, this patch set should generalize to other architectures and number
> > of NUMA nodes.
>
> I think moving the test results to the cover letter will help reviewers
> better understand the intent of the work. Also, I think it will also be
> very helpful to include some potential use-cases in here as well. That is,
> what workloads would benefit from placing pages according to a set ratio,
> rather than using existing migration policies that adjust this based on
> hotness / coldness?

Noted. I will be sure to include that in the next revision.

> One such use case that I can think of is using this patchset + weighted
> interleave auto-tuning, which would help alleviate bandwidth limitations
> by ensuring that past the allocation stage, pages are being accessed
> in a way that maximizes the bandwidth usage of the system (at the cost of
> latency, which may or may not even be true based on how bandwidth-bound
> the workload is).

This was the exact use case I envisioned for this patch. I talk about it in more
detail in my reply to SeongJae.

> Thank you again for the amazing patchset! Have a great day : -)
> Joshua

I appreciate you taking the time to respond,
Bijan