Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 19 Aug 2025 19:43:06 +0800

On Tue, Aug 19, 2025 at 7:11 PM Gutierrez Asier
<gutierrez.asier@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 8/18/2025 4:17 PM, Usama Arif wrote:
> >
> >
> > On 18/08/2025 06:55, Yafang Shao wrote:
> >> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> >> THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> >> programs to influence THP order selection based on factors such as:
> >> - Workload identity
> >>   For example, workloads running in specific containers or cgroups.
> >> - Allocation context
> >>   Whether the allocation occurs during a page fault, khugepaged, or other
> >>   paths.
> >> - System memory pressure
> >>   (May require new BPF helpers to accurately assess memory pressure.)
> >>
> >> Key Details:
> >> - Only one BPF program can be attached at a time, but it can be updated
> >>   dynamically to adjust the policy.
> >> - Supports automatic mTHP order selection and per-workload THP policies.
> >> - Only functional when THP is set to madise or always.
> >>
> >> It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> >> This feature is unstable and may evolve in future kernel versions.
> >>
> >> Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ [0]
> >> Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> >>
> >> Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> >> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> >> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> >> ---
> >>  include/linux/huge_mm.h    |  15 +++
> >>  include/linux/khugepaged.h |  12 ++-
> >>  mm/Kconfig                 |  12 +++
> >>  mm/Makefile                |   1 +
> >>  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
> >>  mm/huge_memory.c           |  10 ++
> >>  mm/khugepaged.c            |  26 +++++-
> >>  mm/memory.c                |  18 +++-
> >>  8 files changed, 273 insertions(+), 7 deletions(-)
> >>  create mode 100644 mm/bpf_thp.c
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 1ac0d06fb3c1..f0c91d7bd267 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -6,6 +6,8 @@
> >>
> >>  #include <linux/fs.h> /* only for vma_is_dax() */
> >>  #include <linux/kobject.h>
> >> +#include <linux/pgtable.h>
> >> +#include <linux/mm.h>
> >>
> >>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> >>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >> @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
> >>      TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >>      TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> >>      TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> >> +    TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> >>  };
> >>
> >>  struct kobject;
> >> @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
> >>                      (1<<TRANSPARENT_HUGEPAGE_FLAG);
> >>  }
> >>
> >> +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> >> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> >> +                    u64 vma_flags, enum tva_type tva_flags, int orders);
> >> +#else
> >> +static inline int
> >> +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> >> +                u64 vma_flags, enum tva_type tva_flags, int orders)
> >> +{
> >> +    return orders;
> >> +}
> >> +#endif
> >> +
> >>  static inline int highest_order(unsigned long orders)
> >>  {
> >>      return fls_long(orders) - 1;
> >> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> >> index eb1946a70cff..d81c1228a21f 100644
> >> --- a/include/linux/khugepaged.h
> >> +++ b/include/linux/khugepaged.h
> >> @@ -4,6 +4,8 @@
> >>
> >>  #include <linux/mm.h>
> >>
> >> +#include <linux/huge_mm.h>
> >> +
> >>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>  extern struct attribute_group khugepaged_attr_group;
> >> @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >>
> >>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >>  {
> >> -    if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> >> +    /*
> >> +     * THP allocation policy can be dynamically modified via BPF. Even if a
> >> +     * task was allowed to allocate THPs, BPF can decide whether its forked
> >> +     * child can allocate THPs.
> >> +     *
> >> +     * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> >> +     */
> >> +    if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> >> +            get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> >
> > Hi Yafang,
> >
> > From the coverletter, one of the potential usecases you are trying to solve for is if global policy
> > is "never", but the workload want THPs (either always or on madvise basis). But over here,
> > MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
> > always evaluate to false and the get_sugested_order call doesnt matter?
> >
> >
> >
> >>              __khugepaged_enter(mm);
> >>  }
> >>
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 4108bcd96784..d10089e3f181 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
> >>
> >>        EXPERIMENTAL because the impact of some changes is still unclear.
> >>
> >> +config EXPERIMENTAL_BPF_ORDER_SELECTION
> >> +    bool "BPF-based THP order selection (EXPERIMENTAL)"
> >> +    depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> >> +
> >> +    help
> >> +      Enable dynamic THP order selection using BPF programs. This
> >> +      experimental feature allows custom BPF logic to determine optimal
> >> +      transparent hugepage allocation sizes at runtime.
> >> +
> >> +      Warning: This feature is unstable and may change in future kernel
> >> +      versions.
> >> +
> >
> >
> > I know there was a discussion on this earlier, but my opinion is that putting all of this
> > as experiment with warnings is not great. No one will be able to deploy this in production
> > if its going to be removed, and I believe thats where the real usage is.
> >
> If the goal is to deploy it in Kubernetes, I believe eBPF is the wrong way to do it. Right
> now eBPF is used mainly for networking (CNI).

As I recall, I've already shared the Kubernetes deployment procedure
with you. [0]
If you’re using k8s, you should definitely check this out.
JFYI, we have already deployed this in our Kubernetes production environment.

[0] https://lore.kernel.org/linux-mm/CALOAHbDJPP499ZDitUYqThAJ_BmpeWN_NVR-wm=8XBe3X7Wxkw@xxxxxxxxxxxxxx/

>
> Kubernetes currently has something called Dynamic Resource Allocation (DRA), which is already
> in alpha version. It's main use is to share GPU and TPU among many pods. Still, we should
> take into account how likely the user space is going to use eBPF for controlling resources and
> how it integrates with the mechanisms currently available for resource control by the user
> space.

This is unrelated to the current feature.

>
> There is another scenario, when you you have a number of pods and a limit of huge pages you
> want to use among them. Something similar to HugeTLBfs. Could this be achieved with your
> ebpf implementation?

This feature focuses on policy adjustment rather than resource control.

-- 
Regards
Yafang