On Tue, Aug 26, 2025 at 03:19:38PM +0800, Yafang Shao wrote: > Background > ========== > > Our production servers consistently configure THP to "never" due to > historical incidents caused by its behavior. Key issues include: > - Increased Memory Consumption > THP significantly raises overall memory usage, reducing available memory > for workloads. > > - Latency Spikes > Random latency spikes occur due to frequent memory compaction triggered > by THP. > > - Lack of Fine-Grained Control > THP tuning is globally configured, making it unsuitable for containerized > environments. When multiple workloads share a host, enabling THP without > per-workload control leads to unpredictable behavior. > > Due to these issues, administrators avoid switching to madvise or always > modes—unless per-workload THP control is implemented. > > To address this, we propose BPF-based THP policy for flexible adjustment. > Additionally, as David mentioned [0], this mechanism can also serve as a > policy prototyping tool (test policies via BPF before upstreaming them). I think it's important to highlight here that we are exploring an _experimental_ implementation. > > Proposed Solution > ================= > > As suggested by David [0], we introduce a new BPF interface: I do agree, to be clear, with this broad approach - that is, to provide the minimum information that a reasonable decision can be made upon and to keep things as simple as we can. As per the THP cabal (I think? :) the general consensus was in line with this. > > /** > * @get_suggested_order: Get the suggested THP orders for allocation > * @mm: mm_struct associated with the THP allocation > * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL) > * When NULL, the decision should be based on @mm (i.e., when > * triggered from an mm-scope hook rather than a VMA-specific > * context). I'm a little wary of handing a VMA to BPF, under what locking would it be provided? > * Must belong to @mm (guaranteed by the caller). > * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL) Hmm this one is also a bit odd - why would these flags differ? Note that I will be changing the VMA flags to a bitmap relatively soon which may be larger than the system word size. So 'handing around all the flags' is something we probably want to avoid. For the f_op->mmap_prepare stuff I provided an abstraction > * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL) > * @orders: Bitmask of requested THP orders for this allocation > * - PMD-mapped allocation if PMD_ORDER is set > * - mTHP allocation otherwise > * > * Rerurn: Bitmask of suggested THP orders for allocation. The highest Obv. a cover letter thing but typo her :P rerurn -> return. > * suggested order will not exceed the highest requested order > * in @orders. In what sense are they 'suggested'? Is this a product of sysfs settings or? I think this needs to be clearer. > */ > int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > u64 vma_flags, enum tva_type tva_flags, int orders) __rcu; Also here in what sense is this suggested? :) > > This interface: > - Supports both use cases (per-workload tuning + policy prototyping). > - Can be extended with BPF helpers (e.g., for memory pressure awareness). Hm how would extensions like this work? > > This is an experimental feature. To use it, you must enable > CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION. Yes! Thanks. I am glad we are putting this behind a config flag. > > Warning: > - The interface may change > - Behavior may differ in future kernel versions > - We might remove it in the future > > > Selftests > ========= > > BPF selftests > ------------- > > Patch #5: Implements a basic BPF THP policy that restricts THP allocation > via khugepaged to tasks within a specified memory cgroup. > Patch #6: Contains test cases validating the khugepaged fork behavior. > Patch #7: Provides tests for dynamic BPF program updates and replacement. > Patch #8: Includes negative tests for invalid BPF helper usage, verifying > proper verification by the BPF verifier. > > Currently, several dependency patches reside in mm-new but haven't been > merged into bpf-next: > mm: add bitmap mm->flags field > mm/huge_memory: convert "tva_flags" to "enum tva_type" > mm: convert core mm to mm_flags_*() accessors > > To enable BPF CI testing, these dependencies were manually applied to > bpf-next [1]. All selftests in this series pass successfully. The observed > CI failures are unrelated to these changes. Cool, glad at least my mm changes were ok :) > > Performance Evaluation > ---------------------- > > As suggested by Usama [2], performance impact was measured given the page > fault handler modifications. The standard `perf bench mem memset` benchmark > was employed to assess page fault performance. > > Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA > node). Due to variance between individual test runs, a script executed > 10000 iterations to calculate meaningful averages and standard deviations. > > The results across three configurations show negligible performance impact: > - Baseline (without this patch series) > - With patch series but no BPF program attached > - With patch series and BPF program attached > > The result are as follows, > > Number of runs: 10,000 > Average throughput: 40-41 GB/sec > Standard deviation: 7-8 GB/sec You're not giving data comparing the 3? Could you do so? Thanks. > > Production verification > ----------------------- > > We have successfully deployed a variant of this approach across numerous > Kubernetes production servers. The implementation enables THP for specific > workloads (such as applications utilizing ZGC [3]) while disabling it for > others. This selective deployment has operated flawlessly, with no > regression reports to date. > > For ZGC-based applications, our verification demonstrates that shmem THP > delivers significant improvements: > - Reduced CPU utilization > - Lower average latencies Obviously it's _really key_ to point out that this feature is intendend to be _absolutely_ ephemeral - we may or may not implement something like this - it's really about both exploring how such an interface might look and also helping to determine how an 'automagic' future might look. > > Future work > =========== > > Based on our validation with production workloads, we observed mixed > results with XFS large folios (also known as File THP): > > - Performance Benefits > Some workloads demonstrated significant improvements with XFS large > folios enabled > - Performance Regression > Some workloads experienced degradation when using XFS large folios > > These results demonstrate that File THP, similar to anonymous THP, requires > a more granular approach instead of a uniform implementation. > > We will extend the BPF-based order selection mechanism to support File THP > allocation policies. > > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ [0] > Link: https://github.com/kernel-patches/bpf/pull/9561 [1] > Link: https://lwn.net/ml/all/a24d632d-4b11-4c88-9ed0-26fa12a0fce4@xxxxxxxxx/ [2] > Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTransparentHugePagesOnLinux [3] > > Changes: > ======= > > RFC v5-> v6: > - Code improvement around the RCU usage (Usama) > - Add selftests for khugepaged fork (Usama) > - Add performance data for page fault (Usama) > - Remove the RFC tag > Sorry I haven't been involved in the RFC reviews, always intended to but workload etc. Will be looking through this series as very interested in exploring this approach. Cheers, Lorenzo > RFC v4->v5: https://lwn.net/Articles/1034265/ > - Add support for vma (David) > - Add mTHP support in khugepaged (Zi) > - Use bitmask of all allowed orders instead (Zi) > - Retrieve the page size and PMD order rather than hardcoding them (Zi) > > RFC v3->v4: https://lwn.net/Articles/1031829/ > - Use a new interface get_suggested_order() (David) > - Mark it as experimental (David, Lorenzo) > - Code improvement in THP (Usama) > - Code improvement in BPF struct ops (Amery) > > RFC v2->v3: https://lwn.net/Articles/1024545/ > - Finer-graind tuning based on madvise or always mode (David, Lorenzo) > - Use BPF to write more advanced policies logic (David, Lorenzo) > > RFC v1->v2: https://lwn.net/Articles/1021783/ > The main changes are as follows, > - Use struct_ops instead of fmod_ret (Alexei) > - Introduce a new THP mode (Johannes) > - Introduce new helpers for BPF hook (Zi) > - Refine the commit log > > RFC v1: https://lwn.net/Articles/1019290/ > > Yafang Shao (10): > mm: thp: add support for BPF based THP order selection > mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() > mm: thp: add a new kfunc bpf_mm_get_task() > bpf: mark vma->vm_mm as trusted > selftests/bpf: add a simple BPF based THP policy > selftests/bpf: add test case for khugepaged fork > selftests/bpf: add test case to update thp policy > selftests/bpf: add test cases for invalid thp_adjust usage > Documentation: add BPF-based THP adjustment documentation > MAINTAINERS: add entry for BPF-based THP adjustment > > Documentation/admin-guide/mm/transhuge.rst | 47 +++ > MAINTAINERS | 10 + > include/linux/huge_mm.h | 15 + > include/linux/khugepaged.h | 12 +- > kernel/bpf/verifier.c | 5 + > mm/Kconfig | 12 + > mm/Makefile | 1 + > mm/bpf_thp.c | 269 ++++++++++++++ > mm/huge_memory.c | 10 + > mm/khugepaged.c | 26 +- > mm/memory.c | 18 +- > tools/testing/selftests/bpf/config | 3 + > .../selftests/bpf/prog_tests/thp_adjust.c | 343 ++++++++++++++++++ > .../selftests/bpf/progs/test_thp_adjust.c | 115 ++++++ > .../bpf/progs/test_thp_adjust_trusted_vma.c | 27 ++ > .../progs/test_thp_adjust_unreleased_memcg.c | 24 ++ > .../progs/test_thp_adjust_unreleased_task.c | 25 ++ > 17 files changed, 955 insertions(+), 7 deletions(-) > create mode 100644 mm/bpf_thp.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_memcg.c > create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_task.c > > -- > 2.47.3 >