Background ---------- Our production servers consistently configure THP to "never" due to historical incidents caused by its behavior. Key issues include: - Increased Memory Consumption THP significantly raises overall memory usage, reducing available memory for workloads. - Latency Spikes Random latency spikes occur due to frequent memory compaction triggered by THP. - Lack of Fine-Grained Control THP tuning is globally configured, making it unsuitable for containerized environments. When multiple workloads share a host, enabling THP without per-workload control leads to unpredictable behavior. Due to these issues, administrators avoid switching to madvise or always modes—unless per-workload THP control is implemented. To address this, we propose BPF-based THP policy for flexible adjustment. Additionally, as David mentioned [0], this mechanism can also serve as a policy prototyping tool (test policies via BPF before upstreaming them). Proposed Solution ----------------- As suggested by David [0], we introduce a new BPF interface: /** * @get_suggested_order: Get the suggested highest THP order for allocation * @mm: mm_struct associated with the THP allocation * @tva_flags: TVA flags for current context * %TVA_IN_PF: Set when in page fault context * Other flags: Reserved for future use * @order: The highest order being considered for this THP allocation. * %PUD_ORDER for PUD-mapped allocations * %PMD_ORDER for PMD-mapped allocations * %PMD_ORDER - 1 for mTHP allocations * * Rerurn: Suggested highest THP order to use for allocation. The returned * order will never exceed the input @order value. */ int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order); This interface: - Supports both use cases (per-workload tuning + policy prototyping). - Can be extended with BPF helpers (e.g., for memory pressure awareness). This is an experimental feature. To use it, you must enable CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION. Warning: - The interface may change - Behavior may differ in future kernel versions - We might remove it in the future A simple test case is included in Patch #4. Changes: RFC v3->v4: - Use a new interface get_suggested_order() (David) - Mark it as experimental (David, Lorenzo) - Code improvement in THP (Usama) - Code improvement in BPF struct ops (Amery) RFC v2->v3: https://lwn.net/Articles/1024545/ - Finer-graind tuning based on madvise or always mode (David, Lorenzo) - Use BPF to write more advanced policies logic (David, Lorenzo) RFC v1->v2: https://lwn.net/Articles/1021783/ The main changes are as follows, - Use struct_ops instead of fmod_ret (Alexei) - Introduce a new THP mode (Johannes) - Introduce new helpers for BPF hook (Zi) - Refine the commit log RFC v1: https://lwn.net/Articles/1019290/ Yafang Shao (4): mm: thp: add support for BPF based THP order selection mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() mm: thp: add a new kfunc bpf_mm_get_task() selftest/bpf: add selftest for BPF based THP order seletection include/linux/huge_mm.h | 13 + include/linux/khugepaged.h | 12 +- mm/Kconfig | 12 + mm/Makefile | 1 + mm/bpf_thp.c | 255 ++++++++++++++++++ mm/huge_memory.c | 9 + mm/khugepaged.c | 18 +- mm/memory.c | 14 +- tools/testing/selftests/bpf/config | 2 + .../selftests/bpf/prog_tests/thp_adjust.c | 183 +++++++++++++ .../selftests/bpf/progs/test_thp_adjust.c | 69 +++++ .../bpf/progs/test_thp_adjust_failure.c | 24 ++ 12 files changed, 605 insertions(+), 7 deletions(-) create mode 100644 mm/bpf_thp.c create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c -- 2.43.5