On Tue, May 27, 2025 at 5:27 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 27.05.25 10:40, Yafang Shao wrote: > > On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >>>> I don't think we want to add such a mechanism (new mode) where the > >>>> primary configuration mechanism is through bpf. > >>>> > >>>> Maybe bpf could be used as an alternative, but we should look into a > >>>> reasonable alternative first, like the discussed mctrl()/.../ raised in > >>>> the process_madvise() series. > >>>> > >>>> No "bpf" mode in disguise, please :) > >>> > >>> This goal can be readily achieved using a BPF program. In any case, it > >>> is a feasible solution. > >> > >> No BPF-only solution. > >> > >>> > >>>> > >>>>> We could define > >>>>> the API as follows: > >>>>> > >>>>> struct bpf_thp_ops { > >>>>> /** > >>>>> * @task_thp_mode: Get the THP mode for a specific task > >>>>> * > >>>>> * Return: > >>>>> * - TASK_THP_ALWAYS: "always" mode > >>>>> * - TASK_THP_MADVISE: "madvise" mode > >>>>> * - TASK_THP_NEVER: "never" mode > >>>>> * Future modes can also be added. > >>>>> */ > >>>>> int (*task_thp_mode)(struct task_struct *p); > >>>>> }; > >>>>> > >>>>> For observability, we could add a "THP mode" field to > >>>>> /proc/[pid]/status. For example: > >>>>> > >>>>> $ grep "THP mode" /proc/123/status > >>>>> always > >>>>> $ grep "THP mode" /proc/456/status > >>>>> madvise > >>>>> $ grep "THP mode" /proc/789/status > >>>>> never > >>>>> > >>>>> The THP mode for each task would be determined by the attached BPF > >>>>> program based on the task's attributes. We would place the BPF hook in > >>>>> appropriate kernel functions. Note that this setting wouldn't be > >>>>> inherited during fork/exec - the BPF program would make the decision > >>>>> dynamically for each task. > >>>> > >>>> What would be the mode (default) when the bpf program would not be active? > >>>> > >>>>> This approach also enables runtime adjustments to THP modes based on > >>>>> system-wide conditions, such as memory fragmentation or other > >>>>> performance overheads. The BPF program could adapt policies > >>>>> dynamically, optimizing THP behavior in response to changing > >>>>> workloads. > >>>> > >>>> I am not sure that is the proper way to handle these scenarios: I never > >>>> heard that people would be adjusting the system-wide policy dynamically > >>>> in that way either. > >>>> > >>>> Whatever we do, we have to make sure that what we add won't > >>>> over-complicate things in the future. Having tooling dynamically adjust > >>>> the THP policy of processes that coarsely sounds ... very wrong long-term. > >>> > >>> This is just an example demonstrating how BPF can be used to adjust > >>> its flexibility. Notably, all these policies can be implemented > >>> without modifying the kernel. > >> > >> See below on "policy". > >> > >>> > >>>> > >>>> > > As Liam pointed out in another thread, naming is challenging here - > >>>>> "process" might not be the most accurate term for this context. > >>>> > >>>> No, it's not even a per-process thing. It is per MM, and a MM might be > >>>> used by multiple processes ... > >>> > >>> I consistently use 'thread' for the latter case. > >> > >> You can use CLONE_VM without CLONE_THREAD ... > > > > If I understand correctly, this can only occur for shared THP but not > > anonymous THP. For instance, if either process allocates an anonymous > > THP, it would trigger the creation of a new MM. Please correct me if > > I'm mistaken. > > What clone(CLONE_VM) will do is essentially create a new process, that > shares the MM with the original process. Similar to a thread, just that > the new process will show up in /proc/ as ... a new process, not as a > thread under /prod/$pid/tasks of the original process. > > Both processes will operate on the shared MM struct as if they were > ordinary threads. No Copy-on-Write involved. > > One example use case I've been involved in is async teardown in QEMU [1]. > > [1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf I understand what you mean, but what I'm really confused about is how this relates to allocating anonymous THP. If either one allocates anon THP, it will definitely create a new MM, right ? > > > > >> > >> Additionally, this > >>> can be implemented per-MM without kernel code modifications. > >>> With a well-designed API, users can even implement custom THP > >>> policies—all without altering kernel code. > >> > >> You can switch between modes, that' all you can do. I wouldn't really > >> call that "custom policy" as it is extremely limited. > >> > >> And that's exactly my point: it's basic switching between modes ... a > >> reasonable policy in the future will make placement decisions and not > >> just state "always/never/madvise". > > > > Could you please elaborate further on 'make placement decisions'? As > > previously mentioned, we (including the broader community) really need > > the user input to determine whether THP allocation is appropriate in a > > given case. > > The glorious future were we make smarter decisions where to actually > place THPs even in the "always" mode. > > E.g., just because we enable "always" for a process does not mean that > we really want a THP everywhere; quite the opposite. So 'always' simply means "the system doesn't guarantee THP allocation will succeed" ? If that's the case, we should revisit RFC v1 [0], where we proposed rejecting THP allocations in certain scenarios for specific tasks. [0] https://lwn.net/Articles/1019290/ > > Treat the "always"/"madvise"/"never" as a rough mode, not a future-proof > policy that we would want to fine-tune dynamically ... that would be > very limiting. -- Regards Yafang