On 18/08/2025 06:55, Yafang Shao wrote: > Background > ---------- > > Our production servers consistently configure THP to "never" due to > historical incidents caused by its behavior. Key issues include: > - Increased Memory Consumption > THP significantly raises overall memory usage, reducing available memory > for workloads. > > - Latency Spikes > Random latency spikes occur due to frequent memory compaction triggered > by THP. > > - Lack of Fine-Grained Control > THP tuning is globally configured, making it unsuitable for containerized > environments. When multiple workloads share a host, enabling THP without > per-workload control leads to unpredictable behavior. > > Due to these issues, administrators avoid switching to madvise or always > modes—unless per-workload THP control is implemented. > > To address this, we propose BPF-based THP policy for flexible adjustment. > Additionally, as David mentioned [0], this mechanism can also serve as a > policy prototyping tool (test policies via BPF before upstreaming them). Hi Yafang, A few points: The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem to be anywhere in the coverletter. I am probably missing something over here, but the current version won't accomplish the usecase you have described at the start of the coverletter and are aiming for, right? i.e. THP global policy "never", but get hugepages on an madvise or always basis. I think there was a new THP mode introduced in some earlier revision where you can switch to it from "never" and then you can use bpf programs with it, but its not in this revision? It might be useful to add your specific usecase as a selftest. Do we have some numbers on what the overhead of calling the bpf program is in the pagefault path as its a critical path? I remember there was a discussion on this in the earlier revisions, and I have mentioned this in patch 1 as well, but I think making this feature experimental with warnings might not be a great idea. It could lead to 2 paths: - people don't deploy this in their fleet because its marked as experimental and they dont want their machines to break once they upgrade the kernel and this is changed. We will have a difficult time improving upon this as this is just going to be used for prototyping and won't be driven by production data. - people are careless and deploy it in on their production machines, and you get reports that this has broken after kernel upgrades (despite being marked as experimental :)). This is just my opinion (which can be wrong :)), but I think we should try and have this merged as a stable interface that won't change. There might be bugs reported down the line, but I am hoping we can get the interface of get_suggested_order right in the first implementation that gets merged? Thanks! Usama> > Proposed Solution > ----------------- > > As suggested by David [0], we introduce a new BPF interface: > > /** > * @get_suggested_order: Get the suggested THP orders for allocation > * @mm: mm_struct associated with the THP allocation > * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL) > * When NULL, the decision should be based on @mm (i.e., when > * triggered from an mm-scope hook rather than a VMA-specific > * context). > * Must belong to @mm (guaranteed by the caller). > * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL) > * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL) > * @orders: Bitmask of requested THP orders for this allocation > * - PMD-mapped allocation if PMD_ORDER is set > * - mTHP allocation otherwise > * > * Rerurn: Bitmask of suggested THP orders for allocation. The highest > * suggested order will not exceed the highest requested order > * in @orders. > */ > int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > u64 vma_flags, enum tva_type tva_flags, int orders) __rcu; > > This interface: > - Supports both use cases (per-workload tuning + policy prototyping). > - Can be extended with BPF helpers (e.g., for memory pressure awareness). > > This is an experimental feature. To use it, you must enable > CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION. > > Warning: > - The interface may change > - Behavior may differ in future kernel versions > - We might remove it in the future > > A simple test case is included in Patch #4. > > Future work: > - Extend it to File THP > > Changes: > RFC v4->v5: > - Add support for vma (David) > - Add mTHP support in khugepaged (Zi) > - Use bitmask of all allowed orders instead (Zi) > - Retrieve the page size and PMD order rather than hardcoding them (Zi) > > RFC v3->v4: https://lwn.net/Articles/1031829/ > - Use a new interface get_suggested_order() (David) > - Mark it as experimental (David, Lorenzo) > - Code improvement in THP (Usama) > - Code improvement in BPF struct ops (Amery) > > RFC v2->v3: https://lwn.net/Articles/1024545/ > - Finer-graind tuning based on madvise or always mode (David, Lorenzo) > - Use BPF to write more advanced policies logic (David, Lorenzo) > > RFC v1->v2: https://lwn.net/Articles/1021783/ > The main changes are as follows, > - Use struct_ops instead of fmod_ret (Alexei) > - Introduce a new THP mode (Johannes) > - Introduce new helpers for BPF hook (Zi) > - Refine the commit log > > RFC v1: https://lwn.net/Articles/1019290/ > Yafang Shao (5): > mm: thp: add support for BPF based THP order selection > mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() > mm: thp: add a new kfunc bpf_mm_get_task() > bpf: mark vma->vm_mm as trusted > selftest/bpf: add selftest for BPF based THP order seletection > > include/linux/huge_mm.h | 15 + > include/linux/khugepaged.h | 12 +- > kernel/bpf/verifier.c | 5 + > mm/Kconfig | 12 + > mm/Makefile | 1 + > mm/bpf_thp.c | 269 ++++++++++++++++++ > mm/huge_memory.c | 10 + > mm/khugepaged.c | 26 +- > mm/memory.c | 18 +- > tools/testing/selftests/bpf/config | 3 + > .../selftests/bpf/prog_tests/thp_adjust.c | 224 +++++++++++++++ > .../selftests/bpf/progs/test_thp_adjust.c | 76 +++++ > .../bpf/progs/test_thp_adjust_failure.c | 25 ++ > 13 files changed, 689 insertions(+), 7 deletions(-) > create mode 100644 mm/bpf_thp.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c >