[PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 26 Aug 2025 15:19:38 +0800

Background
==========

Our production servers consistently configure THP to "never" due to
historical incidents caused by its behavior. Key issues include:
- Increased Memory Consumption
  THP significantly raises overall memory usage, reducing available memory
  for workloads.

- Latency Spikes
  Random latency spikes occur due to frequent memory compaction triggered
  by THP.

- Lack of Fine-Grained Control
  THP tuning is globally configured, making it unsuitable for containerized
  environments. When multiple workloads share a host, enabling THP without
  per-workload control leads to unpredictable behavior.

Due to these issues, administrators avoid switching to madvise or always
modes—unless per-workload THP control is implemented.

To address this, we propose BPF-based THP policy for flexible adjustment.
Additionally, as David mentioned [0], this mechanism can also serve as a
policy prototyping tool (test policies via BPF before upstreaming them).

Proposed Solution
=================

As suggested by David [0], we introduce a new BPF interface:

/**
 * @get_suggested_order: Get the suggested THP orders for allocation
 * @mm: mm_struct associated with the THP allocation
 * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
 *                 When NULL, the decision should be based on @mm (i.e., when
 *                 triggered from an mm-scope hook rather than a VMA-specific
 *                 context).
 *                 Must belong to @mm (guaranteed by the caller).
 * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
 * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
 * @orders: Bitmask of requested THP orders for this allocation
 *          - PMD-mapped allocation if PMD_ORDER is set
 *          - mTHP allocation otherwise
 *
 * Rerurn: Bitmask of suggested THP orders for allocation. The highest
 *         suggested order will not exceed the highest requested order
 *         in @orders.
 */
 int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
                            u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;

This interface:
- Supports both use cases (per-workload tuning + policy prototyping).
- Can be extended with BPF helpers (e.g., for memory pressure awareness).

This is an experimental feature. To use it, you must enable
CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.

Warning:
- The interface may change
- Behavior may differ in future kernel versions
- We might remove it in the future

Selftests
=========

BPF selftests
-------------

Patch #5: Implements a basic BPF THP policy that restricts THP allocation
          via khugepaged to tasks within a specified memory cgroup.
Patch #6: Contains test cases validating the khugepaged fork behavior.
Patch #7: Provides tests for dynamic BPF program updates and replacement.
Patch #8: Includes negative tests for invalid BPF helper usage, verifying
          proper verification by the BPF verifier.

Currently, several dependency patches reside in mm-new but haven't been
merged into bpf-next:
  mm: add bitmap mm->flags field
  mm/huge_memory: convert "tva_flags" to "enum tva_type"
  mm: convert core mm to mm_flags_*() accessors

To enable BPF CI testing, these dependencies were manually applied to
bpf-next [1]. All selftests in this series pass successfully. The observed
CI failures are unrelated to these changes.

Performance Evaluation
----------------------

As suggested by Usama [2], performance impact was measured given the page
fault handler modifications. The standard `perf bench mem memset` benchmark
was employed to assess page fault performance.

Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA
node). Due to variance between individual test runs, a script executed
10000 iterations to calculate meaningful averages and standard deviations.

The results across three configurations show negligible performance impact:
- Baseline (without this patch series)
- With patch series but no BPF program attached
- With patch series and BPF program attached

The result are as follows,

  Number of runs: 10,000
  Average throughput: 40-41 GB/sec
  Standard deviation: 7-8 GB/sec

Production verification
-----------------------

We have successfully deployed a variant of this approach across numerous
Kubernetes production servers. The implementation enables THP for specific
workloads (such as applications utilizing ZGC [3]) while disabling it for
others. This selective deployment has operated flawlessly, with no
regression reports to date.

For ZGC-based applications, our verification demonstrates that shmem THP
delivers significant improvements:
- Reduced CPU utilization
- Lower average latencies

Future work
===========

Based on our validation with production workloads, we observed mixed
results with XFS large folios (also known as File THP):

- Performance Benefits
  Some workloads demonstrated significant improvements with XFS large
  folios enabled
- Performance Regression
  Some workloads experienced degradation when using XFS large folios

These results demonstrate that File THP, similar to anonymous THP, requires
a more granular approach instead of a uniform implementation.

We will extend the BPF-based order selection mechanism to support File THP
allocation policies.

Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ [0] 
Link: https://github.com/kernel-patches/bpf/pull/9561 [1]
Link: https://lwn.net/ml/all/a24d632d-4b11-4c88-9ed0-26fa12a0fce4@xxxxxxxxx/ [2]
Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTransparentHugePagesOnLinux [3]

Changes:
=======

RFC v5-> v6:
- Code improvement around the RCU usage (Usama)
- Add selftests for khugepaged fork (Usama)
- Add performance data for page fault (Usama)
- Remove the RFC tag

RFC v4->v5: https://lwn.net/Articles/1034265/
- Add support for vma (David)
- Add mTHP support in khugepaged (Zi)
- Use bitmask of all allowed orders instead (Zi)
- Retrieve the page size and PMD order rather than hardcoding them (Zi)

RFC v3->v4: https://lwn.net/Articles/1031829/
- Use a new interface get_suggested_order() (David)
- Mark it as experimental (David, Lorenzo)
- Code improvement in THP (Usama)
- Code improvement in BPF struct ops (Amery)

RFC v2->v3: https://lwn.net/Articles/1024545/
- Finer-graind tuning based on madvise or always mode (David, Lorenzo)
- Use BPF to write more advanced policies logic (David, Lorenzo)

RFC v1->v2: https://lwn.net/Articles/1021783/
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log

RFC v1: https://lwn.net/Articles/1019290/

Yafang Shao (10):
  mm: thp: add support for BPF based THP order selection
  mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  mm: thp: add a new kfunc bpf_mm_get_task()
  bpf: mark vma->vm_mm as trusted
  selftests/bpf: add a simple BPF based THP policy
  selftests/bpf: add test case for khugepaged fork
  selftests/bpf: add test case to update thp policy
  selftests/bpf: add test cases for invalid thp_adjust usage
  Documentation: add BPF-based THP adjustment documentation
  MAINTAINERS: add entry for BPF-based THP adjustment

 Documentation/admin-guide/mm/transhuge.rst    |  47 +++
 MAINTAINERS                                   |  10 +
 include/linux/huge_mm.h                       |  15 +
 include/linux/khugepaged.h                    |  12 +-
 kernel/bpf/verifier.c                         |   5 +
 mm/Kconfig                                    |  12 +
 mm/Makefile                                   |   1 +
 mm/bpf_thp.c                                  | 269 ++++++++++++++
 mm/huge_memory.c                              |  10 +
 mm/khugepaged.c                               |  26 +-
 mm/memory.c                                   |  18 +-
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 343 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     | 115 ++++++
 .../bpf/progs/test_thp_adjust_trusted_vma.c   |  27 ++
 .../progs/test_thp_adjust_unreleased_memcg.c  |  24 ++
 .../progs/test_thp_adjust_unreleased_task.c   |  25 ++
 17 files changed, 955 insertions(+), 7 deletions(-)
 create mode 100644 mm/bpf_thp.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_task.c

-- 
2.47.3