Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection

Lance Yang <lance.yang@xxxxxxxxx> · Wed, 10 Sep 2025 20:54:18 +0800

On Wed, Sep 10, 2025 at 8:42 PM Lance Yang <lance.yang@xxxxxxxxx> wrote:
>
> Hey Yafang,
>
> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> >
> > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > programs to influence THP order selection based on factors such as:
> > - Workload identity
> >   For example, workloads running in specific containers or cgroups.
> > - Allocation context
> >   Whether the allocation occurs during a page fault, khugepaged, swap or
> >   other paths.
> > - VMA's memory advice settings
> >   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > - Memory pressure
> >   PSI system data or associated cgroup PSI metrics
> >
> > The kernel API of this new BPF hook is as follows,
> >
> > /**
> >  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >  * @vma: vm_area_struct associated with the THP allocation
> >  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >  *            neither is set.
> >  * @tva_type: TVA type for current @vma
> >  * @orders: Bitmask of requested THP orders for this allocation
> >  *          - PMD-mapped allocation if PMD_ORDER is set
> >  *          - mTHP allocation otherwise
> >  *
> >  * Return: The suggested THP order from the BPF program for allocation. It will
> >  *         not exceed the highest requested order in @orders. Return -1 to
> >  *         indicate that the original requested @orders should remain unchanged.
> >  */
> > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >                            enum bpf_thp_vma_type vma_type,
> >                            enum tva_type tva_type,
> >                            unsigned long orders);
> >
> > Only a single BPF program can be attached at any given time, though it can
> > be dynamically updated to adjust the policy. The implementation supports
> > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > file-backed THP.
> >
> > This functionality is only active when system-wide THP is configured to
> > madvise or always mode. It remains disabled in never mode. Additionally,
> > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > functionality will also be unavailable for that task.
> >
> > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > enabled. Note that this capability is currently unstable and may undergo
> > significant changes—including potential removal—in future kernel versions.
> >
> > Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > ---
> [...]
> > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > new file mode 100644
> > index 000000000000..525ee22ab598
> > --- /dev/null
> > +++ b/mm/huge_memory_bpf.c
> > @@ -0,0 +1,243 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * BPF-based THP policy management
> > + *
> > + * Author: Yafang Shao <laoar.shao@xxxxxxxxx>
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +enum bpf_thp_vma_type {
> > +       BPF_THP_VM_NONE = 0,
> > +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > +};
> > +
> > +/**
> > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > + * @vma: vm_area_struct associated with the THP allocation
> > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > + *            neither is set.
> > + * @tva_type: TVA type for current @vma
> > + * @orders: Bitmask of requested THP orders for this allocation
> > + *          - PMD-mapped allocation if PMD_ORDER is set
> > + *          - mTHP allocation otherwise
> > + *
> > + * Return: The suggested THP order from the BPF program for allocation. It will
> > + *         not exceed the highest requested order in @orders. Return -1 to
> > + *         indicate that the original requested @orders should remain unchanged.
>
> A minor documentation nit: the comment says "Return -1 to indicate that the
> original requested @orders should remain unchanged". It might be slightly
> clearer to say "Return a negative value to fall back to the original
> behavior". This would cover all error codes as well ;)
>
> > + */
> > +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > +                          enum bpf_thp_vma_type vma_type,
> > +                          enum tva_type tva_type,
> > +                          unsigned long orders);
>
> Sorry if I'm missing some context here since I haven't tracked the whole
> series closely.
>
> Regarding the return value for thp_order_fn_t: right now it returns a
> single int order. I was thinking, what if we let it return an unsigned
> long bitmask of orders instead? This seems like it would be more flexible
> down the road, especially if we get more mTHP sizes to choose from. It
> would also make the API more consistent, as bpf_hook_thp_get_orders()
> itself returns an unsigned long ;)

I just realized a flaw in my previous suggestion :(

Changing the return type of thp_order_fn_t to unsigned long for consistency
and flexibility. However, I completely overlooked that this would prevent
the BPF program from returning negative error codes ...

Thanks,
Lance

>
> Also, for future extensions, it might be a good idea to add a reserved
> flags argument to the thp_order_fn_t signature.
>
> For example thp_order_fn_t(..., unsigned long flags).
>
> This would give us aforward-compatible way to add new semantics later
> without breaking the ABI and needing a v2. We could just require it to be
> 0 for now.
>
> Thanks for the great work!
> Lance