Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Thu, 28 Aug 2025 11:50:16 +0100

On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote:
> On Wed, Aug 27, 2025 at 11:03 PM Lorenzo Stoakes
> <lorenzo.stoakes@xxxxxxxxxx> wrote:
> >
> > On Tue, Aug 26, 2025 at 03:19:39PM +0800, Yafang Shao wrote:
> > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> > > programs to influence THP order selection based on factors such as:
> > > - Workload identity
> > >   For example, workloads running in specific containers or cgroups.
> > > - Allocation context
> > >   Whether the allocation occurs during a page fault, khugepaged, or other
> > >   paths.
> > > - System memory pressure
> > >   (May require new BPF helpers to accurately assess memory pressure.)
> > >
> > > Key Details:
> > > - Only one BPF program can be attached at a time, but it can be updated
> > >   dynamically to adjust the policy.
> > > - Supports automatic mTHP order selection and per-workload THP policies.
> > > - Only functional when THP is set to madise or always.
> > >
> > > It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> > > This feature is unstable and may evolve in future kernel versions.
> > >
> > > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ [0]
> > > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> > >
> > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > > ---
> > >  include/linux/huge_mm.h    |  15 +++
> > >  include/linux/khugepaged.h |  12 ++-
> > >  mm/Kconfig                 |  12 +++
> > >  mm/Makefile                |   1 +
> > >  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
> >
> > Please add new files to MAINTAINERS as you add them.
>
> will do it.
>
> >
> > >  mm/huge_memory.c           |  10 ++
> > >  mm/khugepaged.c            |  26 +++++-
> > >  mm/memory.c                |  18 +++-
> > >  8 files changed, 273 insertions(+), 7 deletions(-)
> > >  create mode 100644 mm/bpf_thp.c
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 1ac0d06fb3c1..f0c91d7bd267 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -6,6 +6,8 @@
> > >
> > >  #include <linux/fs.h> /* only for vma_is_dax() */
> > >  #include <linux/kobject.h>
> > > +#include <linux/pgtable.h>
> > > +#include <linux/mm.h>
> >
> > Hm this is a bit weird as mm.h includes huge_mm... I guess it will be handled by
> > header defines but still.
>
> Some refactoring is needed for these two header files, but we can
> handle it separately later.
>
> >
> > >
> > >  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> > >  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
> > >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> > >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> > >  };
> > >
> > >  struct kobject;
> > > @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
> > >                       (1<<TRANSPARENT_HUGEPAGE_FLAG);
> > >  }
> > >
> > > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                     u64 vma_flags, enum tva_type tva_flags, int orders);
> >
> > Not a massive fan of this naming to be honest. I think it should explicitly
> > reference bpf, e.g. bpf_hook_thp_get_order() or something.
>
> will change it to bpf_hook_thp_get_orders().

Thanks!

>
> >
> > Right now this is super unclear as to what it's for.
> >
> > Also wrt vma_flags - this type is wrong :) it's vm_flags_t and going to change
> > to a bitmap of unlimiiteeed size soon. So probs best not to pass around as value
> > type either.
>
> As replied in another thread. I will change it.

Thanks. Will check the other thread.

>
> >
> > But unclear us to purpose as mentioned elsewhere.
> >
> > And also get_suggested_order() should be get_suggested_orderS() no? As you
> > seem later in the code to be referencing a bitfield?
>
> Right, it should be bpf_hook_thp_get_orderS().

Thanks!

>
> >
> > Also will mm ever != vma->vm_mm?
>
> No it can't. It can be guaranteed by the caller.

In this case we don't need to pass mm separately then right?

>
> >
> > Are we hacking this for the sake of overloading what this does?
>
> The @vma is actually unneeded. I will remove it.

Ah OK.

I am still a little concerned about passing around a value reference to the VMA
flags though, esp as this type can + will change in future (not sure what that
means for BPF).

We may go to e.g. a 128 bit bitmap there etc.

>
> >
> > Also if we're returning a bitmask of orders which you seem to be (not sure I
> > like that tbh - I feel like we shoudl simply provide one order but open for
> > disucssion) - shouldn't it return an unsigned long?
>
> We are indifferent to whether a single order or a bitmask is returned,
> as we only use order-0 and order-9. We have no use cases for
> middle-order pages, though this feature might be useful for other
> architectures or for some special use cases.

Well surely we want to potentially specify a mTHP under certain circumstances
no?

In any case I feel it's worth making any bitfield a system word size.

>
> >
> > > +#else
> > > +static inline int
> > > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                 u64 vma_flags, enum tva_type tva_flags, int orders)
> > > +{
> > > +     return orders;
> > > +}
> > > +#endif
> > > +
> > >  static inline int highest_order(unsigned long orders)
> > >  {
> > >       return fls_long(orders) - 1;
> > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > index eb1946a70cff..d81c1228a21f 100644
> > > --- a/include/linux/khugepaged.h
> > > +++ b/include/linux/khugepaged.h
> > > @@ -4,6 +4,8 @@
> > >
> > >  #include <linux/mm.h>
> > >
> > > +#include <linux/huge_mm.h>
> > > +
> >
> > Hm this is iffy too, There's probably a reason we didn't include this before,
> > the headers can be so so fragile. Let's be cautious...
>
> I will check.

Thanks!

>
> >
> > >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >  extern struct attribute_group khugepaged_attr_group;
> > > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > >
> > >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > >  {
> > > -     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> > > +     /*
> > > +      * THP allocation policy can be dynamically modified via BPF. Even if a
> > > +      * task was allowed to allocate THPs, BPF can decide whether its forked
> > > +      * child can allocate THPs.
> > > +      *
> > > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > > +      */
> > > +     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> > > +             get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> >
> > Hmmm so there seems to be some kind of additional functionality you're providing
> > here kinda quietly, which is to allow the exact same interface to determine
> > whether we kick off khugepaged or not.
> >
> > Don't love that, I think we should be hugely specific about that.
> >
> > This bpf interface should literally be 'ok we're deciding what order we
> > want'. It feels like a bit of a gross overloading?
>
> This makes sense. I have no objection to reverting to returning a single order.

OK but key point here is - we're now determining if a forked child can _not_
allocate THPs using this function.

To me this should be a separate function rather than some _weird_ usage of this
same function.

And generally at this point I think we should just drop this bit of code
honestly.

>
> >
> > >               __khugepaged_enter(mm);
> > >  }
> > >
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 4108bcd96784..d10089e3f181 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
> > >
> > >         EXPERIMENTAL because the impact of some changes is still unclear.
> > >
> > > +config EXPERIMENTAL_BPF_ORDER_SELECTION
> > > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > > +
> > > +     help
> > > +       Enable dynamic THP order selection using BPF programs. This
> > > +       experimental feature allows custom BPF logic to determine optimal
> > > +       transparent hugepage allocation sizes at runtime.
> > > +
> > > +       Warning: This feature is unstable and may change in future kernel
> > > +       versions.
> >
> > Thanks! This is important to document. Absolute nitty nit: can you capitalise
> > 'WARNING'? Thanks!
>
> will do it.

Thanks!

>
> >
> > > +
> > >  endif # TRANSPARENT_HUGEPAGE
> > >
> > >  # simple helper to make the code a bit easier to read
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index ef54aa615d9d..cb55d1509be1 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> > >  obj-$(CONFIG_NUMA) += memory-tiers.o
> > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
> > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> > >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > > new file mode 100644
> > > index 000000000000..fbff3b1bb988
> > > --- /dev/null
> > > +++ b/mm/bpf_thp.c
> >
> > As mentioned before, please update MAINTAINERS for new files. I went to great +
> > painful lengths to get everything listed there so let's keep it that way please
> > :P
>
> will do it.

Thanks!

>
> >
> > > @@ -0,0 +1,186 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +#include <linux/bpf.h>
> > > +#include <linux/btf.h>
> > > +#include <linux/huge_mm.h>
> > > +#include <linux/khugepaged.h>
> > > +
> > > +struct bpf_thp_ops {
> > > +     /**
> > > +      * @get_suggested_order: Get the suggested THP orders for allocation
> > > +      * @mm: mm_struct associated with the THP allocation
> > > +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> > > +      *                 When NULL, the decision should be based on @mm (i.e., when
> > > +      *                 triggered from an mm-scope hook rather than a VMA-specific
> > > +      *                 context).
> > > +      *                 Must belong to @mm (guaranteed by the caller).
> > > +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> > > +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> > > +      * @orders: Bitmask of requested THP orders for this allocation
> > > +      *          - PMD-mapped allocation if PMD_ORDER is set
> > > +      *          - mTHP allocation otherwise
> > > +      *
> > > +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> > > +      *         suggested order will not exceed the highest requested order
> > > +      *         in @orders.
> > > +      */
> > > +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
> >
> > I feel like we should be declaring this function pointer type somewhere else as
> > we're now duplicating this in two places.
>
> agreed, I have already done it to fix the spare warning.

Thanks!

>
> >
> > > +};
> > > +
> > > +static struct bpf_thp_ops bpf_thp;
> > > +static DEFINE_SPINLOCK(thp_ops_lock);
> > > +
> > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                     u64 vma_flags, enum tva_type tva_flags, int orders)
> >
> > surely tva_flag? As this is an enum value?
>
> will change it to tva_type instead.

Thanks!

>
> >
> > > +{
> > > +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                                u64 vma_flags, enum tva_type tva_flags, int orders);
> >
> > This type for vma flags is totally incorrect. vm_flags_t. And that's going to
> > change soon to an opaque type.
> >
> > Also right now it's actually an unsigned long.
> >
> > I really really do not like that we're providing extra, unexplained VMA flags
> > for some reason. I may be missing something :) so happy to hear why this is
> > necessary.
> >
> > However in future we really shouldn't be passing something like this.
>
> will change it as replied in another thread.

Thanks!

>
> >
> > Also - now a third duplication of the same function pointer :) can we do better
> > than this? At least typedef it.
> >
> > > +     int suggested_orders = orders;
> > > +
> > > +     /* No BPF program is attached */
> > > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > > +                   &transparent_hugepage_flags))
> > > +             return suggested_orders;
> >
> > This is atomic ofc, but are we concerned about races, or I guess you expect only
> > the first attached bpf program to work with it I suppose.
>
> It is against the race to unreg or update.

OK cool, it does make sense overall.

>
> >
> > > +
> > > +     rcu_read_lock();
> >
> > Is this sufficient? Anything stopping the mm or VMA going away here?
>
> This RCU lock is not for protecting the mm or VMA structures
> themselves, but for protecting the update of the function pointer.
> Arbitrary access to pointers within the mm_struct or vm_area_struct is
> prohibited, as they are guarded by the BPF verifier.
>
> >
> > > +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> > > +     if (!bpf_suggested_order)
> > > +             goto out;
> > > +
> > > +     suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
> >
> > OK so now it's suggested order_S but we're invoking suggested order :) whaaatt?
> > :)
>
> will change it.

Thanks!

>
> >
> > > +     if (highest_order(suggested_orders) > highest_order(orders))
> > > +             suggested_orders = orders;
> >
> > Hmmm so the semantics are - whichever is the highest order wins?
>
> The maximum requested order is determined by the callsite. For example:
> - PMD-mapped THP uses PMD_ORDER
> - mTHP uses (PMD_ORDER - 1)
>
> We must respect this upper bound to avoid undefined behavior. So the
> highest suggested order can't exceed the highest requested order.

OK, please document this in a comment here.

>
> >
> > I thought the idea was we'd hand control over to bpf if provided in effect?
> >
> > Definitely worth going over these semantics in the cover letter (and do forgive
> > me if you have and I've missed! :)
>
> It has already in the cover letter:
>
>  * Return: Bitmask of suggested THP orders for allocation. The highest
>  *         suggested order will not exceed the highest requested order
>  *         in @orders.

OK cool thanks, a comment here would be useful also.

>
>
> >
> > > +
> > > +out:
> > > +     rcu_read_unlock();
> > > +     return suggested_orders;
> > > +}
> > > +
> > > +static bool bpf_thp_ops_is_valid_access(int off, int size,
> > > +                                     enum bpf_access_type type,
> > > +                                     const struct bpf_prog *prog,
> > > +                                     struct bpf_insn_access_aux *info)
> > > +{
> > > +     return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> > > +}
> > > +
> > > +static const struct bpf_func_proto *
> > > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > +{
> > > +     return bpf_base_func_proto(func_id, prog);
> > > +}
> > > +
> > > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> > > +     .get_func_proto = bpf_thp_get_func_proto,
> > > +     .is_valid_access = bpf_thp_ops_is_valid_access,
> > > +};
> > > +
> > > +static int bpf_thp_init(struct btf *btf)
> > > +{
> > > +     return 0;
> > > +}
> > > +
> > > +static int bpf_thp_init_member(const struct btf_type *t,
> > > +                            const struct btf_member *member,
> > > +                            void *kdata, const void *udata)
> > > +{
> > > +     return 0;
> > > +}
> > > +
> > > +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> > > +{
> > > +     struct bpf_thp_ops *ops = kdata;
> > > +
> > > +     spin_lock(&thp_ops_lock);
> > > +     if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > > +                          &transparent_hugepage_flags)) {
> > > +             spin_unlock(&thp_ops_lock);
> > > +             return -EBUSY;
> > > +     }
> > > +     WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order));
> > > +     rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order);
> > > +     spin_unlock(&thp_ops_lock);
> > > +     return 0;
> > > +}
> > > +
> > > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> > > +{
> > > +     spin_lock(&thp_ops_lock);
> > > +     clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> > > +     WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> > > +     rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
> > > +     spin_unlock(&thp_ops_lock);
> > > +
> > > +     synchronize_rcu();
> > > +}
> >
> > I am a total beginner with BPF implementations so don't feel like I can say much
> > intelligent about the above. But presumably fairly standard fare BPF-wise?
>
> This implementation is necessary to support BPF program updates.

Ack.

>
> >
> > Will perhaps try to dig deeper on another iteration :) as intersting to me.
> >
> > > +
> > > +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> > > +{
> > > +     struct bpf_thp_ops *ops = kdata;
> > > +     struct bpf_thp_ops *old = old_kdata;
> > > +     int ret = 0;
> > > +
> > > +     if (!ops || !old)
> > > +             return -EINVAL;
> > > +
> > > +     spin_lock(&thp_ops_lock);
> > > +     /* The prog has aleady been removed. */
> > > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
> > > +             ret = -ENOENT;
> > > +             goto out;
> > > +     }
> >
> > OK so we gate things on this flag and it's global, got it.
> >
> > I see this is a hook, and I guess RCU-all-the-things is what BPF does which
> > makes tonnes of sense.
> >
> > > +     WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> > > +     rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
> > > +                         lockdep_is_held(&thp_ops_lock));
> > > +
> > > +out:
> > > +     spin_unlock(&thp_ops_lock);
> > > +     if (!ret)
> > > +             synchronize_rcu();
> > > +     return ret;
> > > +}
> > > +
> > > +static int bpf_thp_validate(void *kdata)
> > > +{
> > > +     struct bpf_thp_ops *ops = kdata;
> > > +
> > > +     if (!ops->get_suggested_order) {
> > > +             pr_err("bpf_thp: required ops isn't implemented\n");
> > > +             return -EINVAL;
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                        u64 vma_flags, enum tva_type vm_flags, int orders)
> > > +{
> > > +     return orders;
> > > +}
> > > +
> > > +static struct bpf_thp_ops __bpf_thp_ops = {
> > > +     .get_suggested_order = suggested_order,
> > > +};
> >
> > Can you explain to me what this stub stuff is for? This is more 'BPF impl 101'
> > stuff sorry :)
>
> It is a CFI stub. cfi_stubs in BPF struct_ops are secure intermediary
> functions that prevent the kernel from making direct, unsafe jumps to
> BPF code. A new attached BPF program will run via this stub.

Ack.

>
> >
> > > +
> > > +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> > > +     .verifier_ops = &thp_bpf_verifier_ops,
> > > +     .init = bpf_thp_init,
> > > +     .init_member = bpf_thp_init_member,
> > > +     .reg = bpf_thp_reg,
> > > +     .unreg = bpf_thp_unreg,
> > > +     .update = bpf_thp_update,
> > > +     .validate = bpf_thp_validate,
> > > +     .cfi_stubs = &__bpf_thp_ops,
> > > +     .owner = THIS_MODULE,
> > > +     .name = "bpf_thp_ops",
> > > +};
> > > +
> > > +static int __init bpf_thp_ops_init(void)
> > > +{
> > > +     int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> > > +
> > > +     if (err)
> > > +             pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> > > +     return err;
> > > +}
> > > +late_initcall(bpf_thp_ops_init);
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index d89992b65acc..bd8f8f34ab3c 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > >               return ret;
> > >       khugepaged_enter_vma(vma, vma->vm_flags);
> > >
> > > +     /*
> > > +      * This check must occur after khugepaged_enter_vma() because:
> > > +      * 1. We may permit THP allocation via khugepaged
> > > +      * 2. While simultaneously disallowing THP allocation
> > > +      *    during page fault handling
> > > +      */
> > > +     if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
> > > +                             BIT(PMD_ORDER))
> >
> > Hmmm so you return a bitmask of orders, but then you only allow this fault if
> > the only order provided is PMD order? That seems strange. Can you explain?
>
> This is in the do_huge_pmd_anonymous_page() that can only accept a PMD
> order, otherwise it might result in unexpected behavior.

OK please document this in the comment.

>
> >
> > > +             return VM_FAULT_FALLBACK;
> >
> > It'd be good to have a helper function for this like:
> >
> >         if (!bpf_hook_allow_pmd_order(vma, tva_flag))
> >                 return VM_FAULT_FALLBACK;
> >
> > And implemented like maybe:
> >
> > static bool bpf_hook_allow_pmd_order(struct vm_area_struct *vma, enum tva_type tva_flag)
> > {
> >         int orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, tva_flag,
> >                         BIT(PMD_ORDER));
> >
> >         return orders & BIT(PMD_ORDER);
> > }
> >
> > It's good the tva flag gives context though.
>
> Thanks for the suggestion.
> will change it.

Thanks!

>
> >
> > > +
> > >       if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> > >                       !mm_forbids_zeropage(vma->vm_mm) &&
> > >                       transparent_hugepage_use_zero_page()) {
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>
> > > index d3d4f116e14b..935583626db6 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> > >  {
> > >       if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> > >           hugepage_pmd_enabled()) {
> > > -             if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> > > +             if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) &&
> > > +                 get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED,
> > > +                                     BIT(PMD_ORDER)))
> >
> > I don't know why we aren't working the bpf hook into thp_vma_allowable_order()?
>
> Actually it can be added into thp_vma_allowable_order().  I will change it.

Thanks!

>
> >
> > Also a helper would work here.
> >
> > >                       __khugepaged_enter(vma->vm_mm);
> > >       }
> > >  }
> > > @@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >               return SCAN_ADDRESS_RANGE;
> > >       if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
> > >               return SCAN_VMA_CHECK;
> > > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER)))
> > > +             return SCAN_VMA_CHECK;
> >
> >
> >
> > >       /*
> > >        * Anon VMA expected, the address may be unmapped then
> > >        * remapped to file after khugepaged reaquired the mmap_lock.
> > > @@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> > >               /* khugepaged_mm_lock actually not necessary for the below */
> > >               mm_slot_free(mm_slot_cache, mm_slot);
> > >               mmdrop(mm);
> > > +     } else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) {
> > > +             hash_del(&slot->hash);
> > > +             list_del(&slot->mm_node);
> > > +             mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> > > +             mm_slot_free(mm_slot_cache, mm_slot);
> > >       }
> > >  }
> > >
> > > @@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > >       if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> > >               return SCAN_VMA_CHECK;
> > >
> > > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> > > +                              BIT(PMD_ORDER)))
> >
> > Again, can we please not duplicate thp_vma_allowable_order() logic?
> >
> > The THP code is horrible enough, but now we have to remember to also do the bpf
> > check?
>
> makes sense.
>
> >
> > > +             return SCAN_VMA_CHECK;
> > >       /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> > >       if (userfaultfd_wp(vma))
> > >               return SCAN_PTE_UFFD_WP;
> > > @@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > >        * the next mm on the list.
> > >        */
> > >       vma = NULL;
> > > +
> > > +     /* If this mm is not suitable for the scan list, we should remove it. */
> > > +     if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> > > +             goto breakouterloop_mmap_lock;
> >
> > OK again I'm really not loving this NULL, 0, -1 stuff. What is this supposed to
> > mean? The idea here is we have a hook for 'trying to determine THP order' and
> > now it's overloaded it seems in multiple ways?
> >
> > I may be missing context here.
> >
> > I'm also a bit perplexed by the comment as to what is intended here.
>
> Using a BPF-based approach for THP adjustment allows us to dynamically
> enable or disable THP for running applications without causing any
> disruption. This capability is particularly valuable in production
> environments. The logic here is designed to achieve exactly that.
>
>
> >
> > >       if (unlikely(!mmap_read_trylock(mm)))
> > >               goto breakouterloop_mmap_lock;
> > >
> > > @@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > >                       progress++;
> > >                       break;
> > >               }
> > > -             if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> > > +             if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) ||
> > > +                 !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED,
> > > +                                      BIT(PMD_ORDER))) {
> >
> > Same various comments from above.
>
> will change it.
>
> >
> > >  skip:
> > >                       progress++;
> > >                       continue;
> > > @@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > >       if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> > >               return -EINVAL;
> > >
> > > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> > > +                              BIT(PMD_ORDER)))
> > > +             return -EINVAL;
> > > +
> >
> > Same various comments from above.
>
> will change it.
>
> >
> > >       cc = kmalloc(sizeof(*cc), GFP_KERNEL);
> > >       if (!cc)
> > >               return -ENOMEM;
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index d9de6c056179..0178857aa058 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> > >  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >  {
> > >       struct vm_area_struct *vma = vmf->vma;
> > > +     int order, suggested_orders;
> > >       unsigned long orders;
> > >       struct folio *folio;
> > >       unsigned long addr;
> > > @@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >       spinlock_t *ptl;
> > >       pte_t *pte;
> > >       gfp_t gfp;
> > > -     int order;
> > >
> > >       /*
> > >        * If uffd is active for the vma we need per-page fault fidelity to
> > > @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >       if (!zswap_never_enabled())
> > >               goto fallback;
> > >
> > > +     suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> > > +                                            TVA_PAGEFAULT,
> > > +                                            BIT(PMD_ORDER) - 1);
> > > +     if (!suggested_orders)
> > > +             goto fallback;
> >

(Thanks for all above! :)

> > Wait, but below we have a bunch of fallbacks, now BPF overrides everything?
>
> When allocating high-order pages is not feasible, such as during
> periods of high memory pressure, the system should immediately fall
> back to using 4 KB pages.

OK makes sense.

>
> >
> > I know I'm repaeting myself :P but can we just please put this into
> > thp_vma_allowable_orders(), it's massively gross to just duplicate this check
> > _everywhere_ with subtle differences.
>
> will change it.

Thanks

>
> >
> > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > >       /*
> > >        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > >        * and suitable for swapping THP.
> > >        */
> > >       orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> > > -                                       BIT(PMD_ORDER) - 1);
> > > +                                       suggested_orders);
> > >       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > >       orders = thp_swap_suitable_orders(swp_offset(entry),
> > >                                         vmf->address, orders);
> > > @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > >  {
> > >       struct vm_area_struct *vma = vmf->vma;
> > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +     int order, suggested_orders;
> > >       unsigned long orders;
> > >       struct folio *folio;
> > >       unsigned long addr;
> > >       pte_t *pte;
> > >       gfp_t gfp;
> > > -     int order;
> > >
> > >       /*
> > >        * If uffd is active for the vma we need per-page fault fidelity to
> > > @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > >       if (unlikely(userfaultfd_armed(vma)))
> > >               goto fallback;
> > >
> > > +     suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> > > +                                            TVA_PAGEFAULT,
> > > +                                            BIT(PMD_ORDER) - 1);
> > > +     if (!suggested_orders)
> > > +             goto fallback;
> >
> > Same comment as above.
>
> will change it.

Thanks!

>
>
> Thanks a lot for your comments.

No problem, thanks for the series!

I am generally excited about exploring this, so once we figure out details be
good to see where this can go!

>
>
> --
> Regards
>
> Yafang

Cheers, Lorenzo