Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 27 May 2025 17:43:24 +0800

On Tue, May 27, 2025 at 5:27 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 27.05.25 10:40, Yafang Shao wrote:
> > On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>
> >>>> I don't think we want to add such a mechanism (new mode) where the
> >>>> primary configuration mechanism is through bpf.
> >>>>
> >>>> Maybe bpf could be used as an alternative, but we should look into a
> >>>> reasonable alternative first, like the discussed mctrl()/.../ raised in
> >>>> the process_madvise() series.
> >>>>
> >>>> No "bpf" mode in disguise, please :)
> >>>
> >>> This goal can be readily achieved using a BPF program. In any case, it
> >>> is a feasible solution.
> >>
> >> No BPF-only solution.
> >>
> >>>
> >>>>
> >>>>> We could define
> >>>>> the API as follows:
> >>>>>
> >>>>> struct bpf_thp_ops {
> >>>>>           /**
> >>>>>            * @task_thp_mode: Get the THP mode for a specific task
> >>>>>            *
> >>>>>            * Return:
> >>>>>            * - TASK_THP_ALWAYS: "always" mode
> >>>>>            * - TASK_THP_MADVISE: "madvise" mode
> >>>>>            * - TASK_THP_NEVER: "never" mode
> >>>>>            * Future modes can also be added.
> >>>>>            */
> >>>>>           int (*task_thp_mode)(struct task_struct *p);
> >>>>> };
> >>>>>
> >>>>> For observability, we could add a "THP mode" field to
> >>>>> /proc/[pid]/status. For example:
> >>>>>
> >>>>> $ grep "THP mode" /proc/123/status
> >>>>> always
> >>>>> $ grep "THP mode" /proc/456/status
> >>>>> madvise
> >>>>> $ grep "THP mode" /proc/789/status
> >>>>> never
> >>>>>
> >>>>> The THP mode for each task would be determined by the attached BPF
> >>>>> program based on the task's attributes. We would place the BPF hook in
> >>>>> appropriate kernel functions. Note that this setting wouldn't be
> >>>>> inherited during fork/exec - the BPF program would make the decision
> >>>>> dynamically for each task.
> >>>>
> >>>> What would be the mode (default) when the bpf program would not be active?
> >>>>
> >>>>> This approach also enables runtime adjustments to THP modes based on
> >>>>> system-wide conditions, such as memory fragmentation or other
> >>>>> performance overheads. The BPF program could adapt policies
> >>>>> dynamically, optimizing THP behavior in response to changing
> >>>>> workloads.
> >>>>
> >>>> I am not sure that is the proper way to handle these scenarios: I never
> >>>> heard that people would be adjusting the system-wide policy dynamically
> >>>> in that way either.
> >>>>
> >>>> Whatever we do, we have to make sure that what we add won't
> >>>> over-complicate things in the future. Having tooling dynamically adjust
> >>>> the THP policy of processes that coarsely sounds ... very wrong long-term.
> >>>
> >>> This is just an example demonstrating how BPF can be used to adjust
> >>> its flexibility. Notably, all these policies can be implemented
> >>> without modifying the kernel.
> >>
> >> See below on "policy".
> >>
> >>>
> >>>>
> >>>>    > > As Liam pointed out in another thread, naming is challenging here -
> >>>>> "process" might not be the most accurate term for this context.
> >>>>
> >>>> No, it's not even a per-process thing. It is per MM, and a MM might be
> >>>> used by multiple processes ...
> >>>
> >>> I consistently use 'thread' for the latter case.
> >>
> >> You can use CLONE_VM without CLONE_THREAD ...
> >
> > If I understand correctly, this can only occur for shared THP but not
> > anonymous THP. For instance, if either process allocates an anonymous
> > THP, it would trigger the creation of a new MM. Please correct me if
> > I'm mistaken.
>
> What clone(CLONE_VM) will do is essentially create a new process, that
> shares the MM with the original process. Similar to a thread, just that
> the new process will show up in /proc/ as ... a new process, not as a
> thread under /prod/$pid/tasks of the original process.
>
> Both processes will operate on the shared MM struct as if they were
> ordinary threads. No Copy-on-Write involved.
>
> One example use case I've been involved in is async teardown in QEMU [1].
>
> [1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf

I understand what you mean, but what I'm really confused about is how
this relates to allocating anonymous THP.  If either one allocates
anon THP, it will definitely create a new MM, right ?

>
> >
> >>
> >> Additionally, this
> >>> can be implemented per-MM without kernel code modifications.
> >>> With a well-designed API, users can even implement custom THP
> >>> policies—all without altering kernel code.
> >>
> >> You can switch between modes, that' all you can do. I wouldn't really
> >> call that "custom policy" as it is extremely limited.
> >>
> >> And that's exactly my point: it's basic switching between modes ... a
> >> reasonable policy in the future will make placement decisions and not
> >> just state "always/never/madvise".
> >
> > Could you please elaborate further on 'make placement decisions'? As
> > previously mentioned, we (including the broader community) really need
> > the user input to determine whether THP allocation is appropriate in a
> > given case.
>
> The glorious future were we make smarter decisions where to actually
> place THPs even in the "always" mode.
>
> E.g., just because we enable "always" for a process does not mean that
> we really want a THP everywhere; quite the opposite.

So 'always' simply means "the system doesn't guarantee THP allocation
will succeed" ? If that's the case, we should revisit RFC v1 [0],
where we proposed rejecting THP allocations in certain scenarios for
specific tasks.

[0] https://lwn.net/Articles/1019290/

>
> Treat the "always"/"madvise"/"never" as a rough mode, not a future-proof
> policy that we would want to fine-tune dynamically ... that would be
> very limiting.

-- 
Regards
Yafang