Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment

David Hildenbrand <david@xxxxxxxxxx> · Tue, 27 May 2025 14:19:16 +0200

On 27.05.25 11:43, Yafang Shao wrote:
On Tue, May 27, 2025 at 5:27 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 27.05.25 10:40, Yafang Shao wrote:
On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

I don't think we want to add such a mechanism (new mode) where the
primary configuration mechanism is through bpf.

Maybe bpf could be used as an alternative, but we should look into a
reasonable alternative first, like the discussed mctrl()/.../ raised in
the process_madvise() series.

No "bpf" mode in disguise, please :)

This goal can be readily achieved using a BPF program. In any case, it
is a feasible solution.

No BPF-only solution.

We could define
the API as follows:

struct bpf_thp_ops {
           /**
            * @task_thp_mode: Get the THP mode for a specific task
            *
            * Return:
            * - TASK_THP_ALWAYS: "always" mode
            * - TASK_THP_MADVISE: "madvise" mode
            * - TASK_THP_NEVER: "never" mode
            * Future modes can also be added.
            */
           int (*task_thp_mode)(struct task_struct *p);
};

For observability, we could add a "THP mode" field to
/proc/[pid]/status. For example:

$ grep "THP mode" /proc/123/status
always
$ grep "THP mode" /proc/456/status
madvise
$ grep "THP mode" /proc/789/status
never

The THP mode for each task would be determined by the attached BPF
program based on the task's attributes. We would place the BPF hook in
appropriate kernel functions. Note that this setting wouldn't be
inherited during fork/exec - the BPF program would make the decision
dynamically for each task.

What would be the mode (default) when the bpf program would not be active?

This approach also enables runtime adjustments to THP modes based on
system-wide conditions, such as memory fragmentation or other
performance overheads. The BPF program could adapt policies
dynamically, optimizing THP behavior in response to changing
workloads.

I am not sure that is the proper way to handle these scenarios: I never
heard that people would be adjusting the system-wide policy dynamically
in that way either.

Whatever we do, we have to make sure that what we add won't
over-complicate things in the future. Having tooling dynamically adjust
the THP policy of processes that coarsely sounds ... very wrong long-term.

This is just an example demonstrating how BPF can be used to adjust
its flexibility. Notably, all these policies can be implemented
without modifying the kernel.

See below on "policy".

    > > As Liam pointed out in another thread, naming is challenging here -
"process" might not be the most accurate term for this context.

No, it's not even a per-process thing. It is per MM, and a MM might be
used by multiple processes ...

I consistently use 'thread' for the latter case.

You can use CLONE_VM without CLONE_THREAD ...

If I understand correctly, this can only occur for shared THP but not
anonymous THP. For instance, if either process allocates an anonymous
THP, it would trigger the creation of a new MM. Please correct me if
I'm mistaken.

What clone(CLONE_VM) will do is essentially create a new process, that
shares the MM with the original process. Similar to a thread, just that
the new process will show up in /proc/ as ... a new process, not as a
thread under /prod/$pid/tasks of the original process.

Both processes will operate on the shared MM struct as if they were
ordinary threads. No Copy-on-Write involved.

One example use case I've been involved in is async teardown in QEMU [1].

[1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf

I understand what you mean, but what I'm really confused about is how
this relates to allocating anonymous THP.  If either one allocates
anon THP, it will definitely create a new MM, right ?

No. They work on the same address space - same MM. Either can allocate a 
new anon THP and the other one would be able to modify it. No fork/CoW.

I only bring it up because it's two "processes" sharing the same MM. And 
the THP mode in your proposal would actually be per-MM and not per process.

It's confusing ... :)

Additionally, this
can be implemented per-MM without kernel code modifications.
With a well-designed API, users can even implement custom THP
policies—all without altering kernel code.

You can switch between modes, that' all you can do. I wouldn't really
call that "custom policy" as it is extremely limited.

And that's exactly my point: it's basic switching between modes ... a
reasonable policy in the future will make placement decisions and not
just state "always/never/madvise".

Could you please elaborate further on 'make placement decisions'? As
previously mentioned, we (including the broader community) really need
the user input to determine whether THP allocation is appropriate in a
given case.

The glorious future were we make smarter decisions where to actually
place THPs even in the "always" mode.

E.g., just because we enable "always" for a process does not mean that
we really want a THP everywhere; quite the opposite.

So 'always' simply means "the system doesn't guarantee THP allocation
will succeed" ?

I mean, with THPs, there are no guarantees, ever :(

If that's the case, we should revisit RFC v1 [0],
where we proposed rejecting THP allocations in certain scenarios for
specific tasks.

Hooking into actual page allocation during page faults (e.g., THP size, 
khugepaged collapse decisions) is IMHO a much better application of ebpf 
than setting a THP mode per process (or MM ... ) using epbf.

So yes, you could drive the system in "always" mode and decide to not 
allocate THPs during page faults / khugepaged for specific processes.

IMHO that also does not contradict the VM_HUGEPAGE / VM_NOHUGEPAGE 
default setting proposal: VM_HUGEPAGE could feed into the epbf program 
as yet another parameter to make a decision.

--
Cheers,

David / dhildenb