Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment

David Hildenbrand <david@xxxxxxxxxx> · Tue, 27 May 2025 11:27:37 +0200

On 27.05.25 10:40, Yafang Shao wrote:
On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

I don't think we want to add such a mechanism (new mode) where the
primary configuration mechanism is through bpf.

Maybe bpf could be used as an alternative, but we should look into a
reasonable alternative first, like the discussed mctrl()/.../ raised in
the process_madvise() series.

No "bpf" mode in disguise, please :)

This goal can be readily achieved using a BPF program. In any case, it
is a feasible solution.

No BPF-only solution.

We could define
the API as follows:

struct bpf_thp_ops {
          /**
           * @task_thp_mode: Get the THP mode for a specific task
           *
           * Return:
           * - TASK_THP_ALWAYS: "always" mode
           * - TASK_THP_MADVISE: "madvise" mode
           * - TASK_THP_NEVER: "never" mode
           * Future modes can also be added.
           */
          int (*task_thp_mode)(struct task_struct *p);
};

For observability, we could add a "THP mode" field to
/proc/[pid]/status. For example:

$ grep "THP mode" /proc/123/status
always
$ grep "THP mode" /proc/456/status
madvise
$ grep "THP mode" /proc/789/status
never

The THP mode for each task would be determined by the attached BPF
program based on the task's attributes. We would place the BPF hook in
appropriate kernel functions. Note that this setting wouldn't be
inherited during fork/exec - the BPF program would make the decision
dynamically for each task.

What would be the mode (default) when the bpf program would not be active?

This approach also enables runtime adjustments to THP modes based on
system-wide conditions, such as memory fragmentation or other
performance overheads. The BPF program could adapt policies
dynamically, optimizing THP behavior in response to changing
workloads.

I am not sure that is the proper way to handle these scenarios: I never
heard that people would be adjusting the system-wide policy dynamically
in that way either.

Whatever we do, we have to make sure that what we add won't
over-complicate things in the future. Having tooling dynamically adjust
the THP policy of processes that coarsely sounds ... very wrong long-term.

This is just an example demonstrating how BPF can be used to adjust
its flexibility. Notably, all these policies can be implemented
without modifying the kernel.

See below on "policy".

   > > As Liam pointed out in another thread, naming is challenging here -
"process" might not be the most accurate term for this context.

No, it's not even a per-process thing. It is per MM, and a MM might be
used by multiple processes ...

I consistently use 'thread' for the latter case.

You can use CLONE_VM without CLONE_THREAD ...

If I understand correctly, this can only occur for shared THP but not
anonymous THP. For instance, if either process allocates an anonymous
THP, it would trigger the creation of a new MM. Please correct me if
I'm mistaken.

What clone(CLONE_VM) will do is essentially create a new process, that 
shares the MM with the original process. Similar to a thread, just that 
the new process will show up in /proc/ as ... a new process, not as a 
thread under /prod/$pid/tasks of the original process.

Both processes will operate on the shared MM struct as if they were 
ordinary threads. No Copy-on-Write involved.

One example use case I've been involved in is async teardown in QEMU [1].

[1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf

Additionally, this
can be implemented per-MM without kernel code modifications.
With a well-designed API, users can even implement custom THP
policies—all without altering kernel code.

You can switch between modes, that' all you can do. I wouldn't really
call that "custom policy" as it is extremely limited.

And that's exactly my point: it's basic switching between modes ... a
reasonable policy in the future will make placement decisions and not
just state "always/never/madvise".

Could you please elaborate further on 'make placement decisions'? As
previously mentioned, we (including the broader community) really need
the user input to determine whether THP allocation is appropriate in a
given case.

The glorious future were we make smarter decisions where to actually 
place THPs even in the "always" mode.

E.g., just because we enable "always" for a process does not mean that 
we really want a THP everywhere; quite the opposite.

Treat the "always"/"madvise"/"never" as a rough mode, not a future-proof 
policy that we would want to fine-tune dynamically ... that would be 
very limiting.

--
Cheers,

David / dhildenb