On 29 Jul 2025, at 5:18, Yafang Shao wrote: > Background > ---------- > > Our production servers consistently configure THP to "never" due to > historical incidents caused by its behavior. Key issues include: > - Increased Memory Consumption > THP significantly raises overall memory usage, reducing available memory > for workloads. > > - Latency Spikes > Random latency spikes occur due to frequent memory compaction triggered > by THP. > > - Lack of Fine-Grained Control > THP tuning is globally configured, making it unsuitable for containerized > environments. When multiple workloads share a host, enabling THP without > per-workload control leads to unpredictable behavior. > > Due to these issues, administrators avoid switching to madvise or always > modes—unless per-workload THP control is implemented. > > To address this, we propose BPF-based THP policy for flexible adjustment. > Additionally, as David mentioned [0], this mechanism can also serve as a The link to [0] is missing. :) > policy prototyping tool (test policies via BPF before upstreaming them). > > Proposed Solution > ----------------- > > As suggested by David [0], we introduce a new BPF interface: > > /** > * @get_suggested_order: Get the suggested highest THP order for allocation > * @mm: mm_struct associated with the THP allocation > * @tva_flags: TVA flags for current context > * %TVA_IN_PF: Set when in page fault context > * Other flags: Reserved for future use > * @order: The highest order being considered for this THP allocation. > * %PUD_ORDER for PUD-mapped allocations There is no PUD THP yet and the highest THP order is PMD_ORDER. It is better to remove the line above to avoid confusion. > * %PMD_ORDER for PMD-mapped allocations > * %PMD_ORDER - 1 for mTHP allocations > * > * Rerurn: Suggested highest THP order to use for allocation. The returned > * order will never exceed the input @order value. > */ > int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order); > > This interface: > - Supports both use cases (per-workload tuning + policy prototyping). > - Can be extended with BPF helpers (e.g., for memory pressure awareness). IIRC, your initial RFC works at VMA level, but this patch targets mm level. Is mm sufficient for your use case? Are you planning to extend the BFP interface to VMA in the future? Just curious. > > This is an experimental feature. To use it, you must enable > CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION. > > Warning: > - The interface may change > - Behavior may differ in future kernel versions > - We might remove it in the future > > A simple test case is included in Patch #4. > > Changes: > RFC v3->v4: > - Use a new interface get_suggested_order() (David) > - Mark it as experimental (David, Lorenzo) > - Code improvement in THP (Usama) > - Code improvement in BPF struct ops (Amery) > > RFC v2->v3: https://lwn.net/Articles/1024545/ > - Finer-graind tuning based on madvise or always mode (David, Lorenzo) > - Use BPF to write more advanced policies logic (David, Lorenzo) > > RFC v1->v2: https://lwn.net/Articles/1021783/ > The main changes are as follows, > - Use struct_ops instead of fmod_ret (Alexei) > - Introduce a new THP mode (Johannes) > - Introduce new helpers for BPF hook (Zi) > - Refine the commit log > > RFC v1: https://lwn.net/Articles/1019290/ > > Yafang Shao (4): > mm: thp: add support for BPF based THP order selection > mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() > mm: thp: add a new kfunc bpf_mm_get_task() > selftest/bpf: add selftest for BPF based THP order seletection > > include/linux/huge_mm.h | 13 + > include/linux/khugepaged.h | 12 +- > mm/Kconfig | 12 + > mm/Makefile | 1 + > mm/bpf_thp.c | 255 ++++++++++++++++++ > mm/huge_memory.c | 9 + > mm/khugepaged.c | 18 +- > mm/memory.c | 14 +- > tools/testing/selftests/bpf/config | 2 + > .../selftests/bpf/prog_tests/thp_adjust.c | 183 +++++++++++++ > .../selftests/bpf/progs/test_thp_adjust.c | 69 +++++ > .../bpf/progs/test_thp_adjust_failure.c | 24 ++ > 12 files changed, 605 insertions(+), 7 deletions(-) > create mode 100644 mm/bpf_thp.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c > > -- > 2.43.5 Best Regards, Yan, Zi