On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > >> I don't think we want to add such a mechanism (new mode) where the > >> primary configuration mechanism is through bpf. > >> > >> Maybe bpf could be used as an alternative, but we should look into a > >> reasonable alternative first, like the discussed mctrl()/.../ raised in > >> the process_madvise() series. > >> > >> No "bpf" mode in disguise, please :) > > > > This goal can be readily achieved using a BPF program. In any case, it > > is a feasible solution. > > No BPF-only solution. > > > > >> > >>> We could define > >>> the API as follows: > >>> > >>> struct bpf_thp_ops { > >>> /** > >>> * @task_thp_mode: Get the THP mode for a specific task > >>> * > >>> * Return: > >>> * - TASK_THP_ALWAYS: "always" mode > >>> * - TASK_THP_MADVISE: "madvise" mode > >>> * - TASK_THP_NEVER: "never" mode > >>> * Future modes can also be added. > >>> */ > >>> int (*task_thp_mode)(struct task_struct *p); > >>> }; > >>> > >>> For observability, we could add a "THP mode" field to > >>> /proc/[pid]/status. For example: > >>> > >>> $ grep "THP mode" /proc/123/status > >>> always > >>> $ grep "THP mode" /proc/456/status > >>> madvise > >>> $ grep "THP mode" /proc/789/status > >>> never > >>> > >>> The THP mode for each task would be determined by the attached BPF > >>> program based on the task's attributes. We would place the BPF hook in > >>> appropriate kernel functions. Note that this setting wouldn't be > >>> inherited during fork/exec - the BPF program would make the decision > >>> dynamically for each task. > >> > >> What would be the mode (default) when the bpf program would not be active? > >> > >>> This approach also enables runtime adjustments to THP modes based on > >>> system-wide conditions, such as memory fragmentation or other > >>> performance overheads. The BPF program could adapt policies > >>> dynamically, optimizing THP behavior in response to changing > >>> workloads. > >> > >> I am not sure that is the proper way to handle these scenarios: I never > >> heard that people would be adjusting the system-wide policy dynamically > >> in that way either. > >> > >> Whatever we do, we have to make sure that what we add won't > >> over-complicate things in the future. Having tooling dynamically adjust > >> the THP policy of processes that coarsely sounds ... very wrong long-term. > > > > This is just an example demonstrating how BPF can be used to adjust > > its flexibility. Notably, all these policies can be implemented > > without modifying the kernel. > > See below on "policy". > > > > >> > >> > > As Liam pointed out in another thread, naming is challenging here - > >>> "process" might not be the most accurate term for this context. > >> > >> No, it's not even a per-process thing. It is per MM, and a MM might be > >> used by multiple processes ... > > > > I consistently use 'thread' for the latter case. > > You can use CLONE_VM without CLONE_THREAD ... If I understand correctly, this can only occur for shared THP but not anonymous THP. For instance, if either process allocates an anonymous THP, it would trigger the creation of a new MM. Please correct me if I'm mistaken. > > Additionally, this > > can be implemented per-MM without kernel code modifications. > > With a well-designed API, users can even implement custom THP > > policies—all without altering kernel code. > > You can switch between modes, that' all you can do. I wouldn't really > call that "custom policy" as it is extremely limited. > > And that's exactly my point: it's basic switching between modes ... a > reasonable policy in the future will make placement decisions and not > just state "always/never/madvise". Could you please elaborate further on 'make placement decisions'? As previously mentioned, we (including the broader community) really need the user input to determine whether THP allocation is appropriate in a given case. -- Regards Yafang