Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 19 Aug 2025 11:08:11 +0800

> Hi Yafang,
>
> From the coverletter, one of the potential usecases you are trying to solve for is if global policy
> is "never", but the workload want THPs (either always or on madvise basis). But over here,
> MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
> always evaluate to false and the get_sugested_order call doesnt matter?

See the replyment in another thread.

>
>
>
> >               __khugepaged_enter(mm);
> >  }
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 4108bcd96784..d10089e3f181 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
> >
> >         EXPERIMENTAL because the impact of some changes is still unclear.
> >
> > +config EXPERIMENTAL_BPF_ORDER_SELECTION
> > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > +
> > +     help
> > +       Enable dynamic THP order selection using BPF programs. This
> > +       experimental feature allows custom BPF logic to determine optimal
> > +       transparent hugepage allocation sizes at runtime.
> > +
> > +       Warning: This feature is unstable and may change in future kernel
> > +       versions.
> > +
>
>
> I know there was a discussion on this earlier, but my opinion is that putting all of this
> as experiment with warnings is not great. No one will be able to deploy this in production
> if its going to be removed, and I believe thats where the real usage is.

See the replyment in another thread.

>
> >  endif # TRANSPARENT_HUGEPAGE
> >
> >  # simple helper to make the code a bit easier to read
> > diff --git a/mm/Makefile b/mm/Makefile
> > index ef54aa615d9d..cb55d1509be1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> >  obj-$(CONFIG_NUMA) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > new file mode 100644
> > index 000000000000..2b03539452d1
> > --- /dev/null
> > +++ b/mm/bpf_thp.c
> > @@ -0,0 +1,186 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +struct bpf_thp_ops {
> > +     /**
> > +      * @get_suggested_order: Get the suggested THP orders for allocation
> > +      * @mm: mm_struct associated with the THP allocation
> > +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> > +      *                 When NULL, the decision should be based on @mm (i.e., when
> > +      *                 triggered from an mm-scope hook rather than a VMA-specific
> > +      *                 context).
> > +      *                 Must belong to @mm (guaranteed by the caller).
> > +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> > +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> > +      * @orders: Bitmask of requested THP orders for this allocation
> > +      *          - PMD-mapped allocation if PMD_ORDER is set
> > +      *          - mTHP allocation otherwise
> > +      *
> > +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> > +      *         suggested order will not exceed the highest requested order
> > +      *         in @orders.
>
> If we want to make this generic enough so that it doesnt change, should we allow suggested order to
> exceed highest requested order?

The maximum requested order is determined by the callsite. For example:
- PMD-mapped THP uses PMD_ORDER
- mTHP uses (PMD_ORDER - 1)

We must respect this upper bound to avoid undefined behavior.

>
> > +      */
> > +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
> > +};
> > +
> > +static struct bpf_thp_ops bpf_thp;
> > +static DEFINE_SPINLOCK(thp_ops_lock);
> > +
> > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                     u64 vma_flags, enum tva_type tva_flags, int orders)
> > +{
> > +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                                u64 vma_flags, enum tva_type tva_flags, int orders);
> > +     int suggested_orders = orders;
> > +
> > +     /* No BPF program is attached */
> > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                   &transparent_hugepage_flags))
> > +             return suggested_orders;
> > +
> > +     rcu_read_lock();
> > +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> > +     if (!bpf_suggested_order)
> > +             goto out;
>
>
> My rcu API knowledge is not the best, but maybe we could do:
>
> if (!rcu_access_pointer(bpf_thp.get_suggested_order))
>         return suggested_orders;
>

There might be a race here.  The current rcu_access_pointer() check
occurs outside the RCU read-side critical section, meaning the
protected pointer could be freed between the check and use.
Therefore, we must perform the NULL check within the RCU read critical
section when dereferencing the pointer:

> rcu_read_lock();
> bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);

 if (!bpf_suggested_order)
     goto out;

>
> I believe this might be better as you dont acquire the rcu_read_lock and avoid
> the lockdep checks when you do rcu_access_pointer, but I might be wrong
> and hope you or someone on the mailing list corrects me if thats the case :)
>
> > +
> > +     suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
> > +     if (highest_order(suggested_orders) > highest_order(orders))
> > +             suggested_orders = orders;
> > +
>
> Maybe we should allow suggested_order to be greater than order if we want this to be truly generic?
> Not a suggestion to change, just to have a discussion.

As replied above, the upper bound is a limitation.

>
> > +out:
> > +     rcu_read_unlock();
> > +     return suggested_orders;
> > +}
> > +
> > +static bool bpf_thp_ops_is_valid_access(int off, int size,
> > +                                     enum bpf_access_type type,
> > +                                     const struct bpf_prog *prog,
> > +                                     struct bpf_insn_access_aux *info)
> > +{
> > +     return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> > +}
> > +
> > +static const struct bpf_func_proto *
> > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +     return bpf_base_func_proto(func_id, prog);
> > +}
> > +
> > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> > +     .get_func_proto = bpf_thp_get_func_proto,
> > +     .is_valid_access = bpf_thp_ops_is_valid_access,
> > +};
> > +
> > +static int bpf_thp_init(struct btf *btf)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int bpf_thp_init_member(const struct btf_type *t,
> > +                            const struct btf_member *member,
> > +                            void *kdata, const void *udata)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_thp_ops *ops = kdata;
> > +
> > +     spin_lock(&thp_ops_lock);
> > +     if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +             &transparent_hugepage_flags)) {
> > +             spin_unlock(&thp_ops_lock);
> > +             return -EBUSY;
> > +     }
> > +     WARN_ON_ONCE(bpf_thp.get_suggested_order);
>
> Should it be WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order)) ?

Nice catch.  I'll make that change

>
> > +     WRITE_ONCE(bpf_thp.get_suggested_order, ops->get_suggested_order);
>
> Should it be rcu_assign_pointer instead of WRITE_ONCE?

will change it.

>
> > +     spin_unlock(&thp_ops_lock);
> > +     return 0;
> > +}
> > +
> > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> > +{
> > +     spin_lock(&thp_ops_lock);
> > +     clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> > +     WARN_ON_ONCE(!bpf_thp.get_suggested_order);
>
> Maybe need to use use rcu_access_pointer here in the warning?

Agreed, will change it.

>
> > +     rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
> > +     spin_unlock(&thp_ops_lock);
> > +
> > +     synchronize_rcu();
> > +}
> > +
> > +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_thp_ops *ops = kdata;
> > +     struct bpf_thp_ops *old = old_kdata;
> > +     int ret = 0;
> > +
> > +     if (!ops || !old)
> > +             return -EINVAL;
> > +
> > +     spin_lock(&thp_ops_lock);
> > +     /* The prog has aleady been removed. */
> > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
> > +             ret = -ENOENT;
> > +             goto out;
> > +     }
> > +     WARN_ON_ONCE(!bpf_thp.get_suggested_order);
>
> As above about using rcu_access_pointer.

will change it.

Thanks for your review!

-- 
Regards
Yafang