Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 20, 2025 at 12:06 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> Background
> ----------
>
> At my current employer, PDD, we have consistently configured THP to "never"
> on our production servers due to past incidents caused by its behavior:
>
> - Increased memory consumption
>   THP significantly raises overall memory usage.
>
> - Latency spikes
>   Random latency spikes occur due to more frequent memory compaction
>   activity triggered by THP.
>
> These issues have made sysadmins hesitant to switch to "madvise" or
> "always" modes.
>
> New Motivation
> --------------
>
> We have now identified that certain AI workloads achieve substantial
> performance gains with THP enabled. However, we’ve also verified that some
> workloads see little to no benefit—or are even negatively impacted—by THP.
>
> In our Kubernetes environment, we deploy mixed workloads on a single server
> to maximize resource utilization. Our goal is to selectively enable THP for
> services that benefit from it while keeping it disabled for others. This
> approach allows us to incrementally enable THP for additional services and
> assess how to make it more viable in production.
>
> Proposed Solution
> -----------------
>
> For this use case, Johannes suggested introducing a dedicated mode [0]. In
> this new mode, we could implement BPF-based THP adjustment for fine-grained
> control over tasks or cgroups. If no BPF program is attached, THP remains
> in "never" mode. This solution elegantly meets our needs while avoiding the
> complexity of managing BPF alongside other THP modes.
>
> A selftest example demonstrates how to enable THP for the current task
> while keeping it disabled for others.
>
> Alternative Proposals
> ---------------------
>
> - Gutierrez’s cgroup-based approach [1]
>   - Proposed adding a new cgroup file to control THP policy.
>   - However, as Johannes noted, cgroups are designed for hierarchical
>     resource allocation, not arbitrary policy settings [2].
>
> - Usama’s per-task THP proposal based on prctl() [3]:
>   - Enabling THP per task via prctl().
>   - As David pointed out, neither madvise() nor prctl() works in "never"
>     mode [4], making this solution insufficient for our needs.
Hi Yafang Shao,

I believe you would have to invert your logic and disable the
processes you dont want using THPs, and have THP="madvise"|"always". I
have yet to look over Usama's solution in detail but I believe this is
possible based on his cover letter.

I also have an alternative solution proposed here!
https://lore.kernel.org/lkml/20250515033857.132535-1-npache@xxxxxxxxxx/

It's different in the sense it doesn't give you granular control per
process, cgroup, or BPF programmability, but it "may" suit your needs
by taming the THP waste and removing the latency spikes of PF time THP
compactions/allocations.

Cheers,
-- Nico

>
> Conclusion
> ----------
>
> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> most effective solution for our requirements. This approach represents a
> small but meaningful step toward making THP truly usable—and manageable—in
> production environments.
>
> This is currently a PoC implementation. Feedback of any kind is welcome.
>
> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@xxxxxxxxxxx/ [0]
> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ [1]
> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@xxxxxxxxxxx/ [2]
> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@xxxxxxxxx/ [3]
> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@xxxxxxxxxx/ [4]
>
> RFC v1->v2:
> The main changes are as follows,
> - Use struct_ops instead of fmod_ret (Alexei)
> - Introduce a new THP mode (Johannes)
> - Introduce new helpers for BPF hook (Zi)
> - Refine the commit log
>
> RFC v1: https://lwn.net/Articles/1019290/
>
> Yafang Shao (5):
>   mm: thp: Add a new mode "bpf"
>   mm: thp: Add hook for BPF based THP adjustment
>   mm: thp: add struct ops for BPF based THP adjustment
>   bpf: Add get_current_comm to bpf_base_func_proto
>   selftests/bpf: Add selftest for THP adjustment
>
>  include/linux/huge_mm.h                       |  15 +-
>  kernel/bpf/cgroup.c                           |   2 -
>  kernel/bpf/helpers.c                          |   2 +
>  mm/Makefile                                   |   3 +
>  mm/bpf_thp.c                                  | 120 ++++++++++++
>  mm/huge_memory.c                              |  65 ++++++-
>  mm/khugepaged.c                               |   3 +
>  tools/testing/selftests/bpf/config            |   1 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 175 ++++++++++++++++++
>  .../selftests/bpf/progs/test_thp_adjust.c     |  39 ++++
>  10 files changed, 414 insertions(+), 11 deletions(-)
>  create mode 100644 mm/bpf_thp.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>
> --
> 2.43.5
>






[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux