On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote: > On Wed, Aug 27, 2025 at 11:03 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > > > On Tue, Aug 26, 2025 at 03:19:39PM +0800, Yafang Shao wrote: > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic > > > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF > > > programs to influence THP order selection based on factors such as: > > > - Workload identity > > > For example, workloads running in specific containers or cgroups. > > > - Allocation context > > > Whether the allocation occurs during a page fault, khugepaged, or other > > > paths. > > > - System memory pressure > > > (May require new BPF helpers to accurately assess memory pressure.) > > > > > > Key Details: > > > - Only one BPF program can be attached at a time, but it can be updated > > > dynamically to adjust the policy. > > > - Supports automatic mTHP order selection and per-workload THP policies. > > > - Only functional when THP is set to madise or always. > > > > > > It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1] > > > This feature is unstable and may evolve in future kernel versions. > > > > > > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ [0] > > > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1] > > > > > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > > --- > > > include/linux/huge_mm.h | 15 +++ > > > include/linux/khugepaged.h | 12 ++- > > > mm/Kconfig | 12 +++ > > > mm/Makefile | 1 + > > > mm/bpf_thp.c | 186 +++++++++++++++++++++++++++++++++++++ > > > > Please add new files to MAINTAINERS as you add them. > > will do it. > > > > > > mm/huge_memory.c | 10 ++ > > > mm/khugepaged.c | 26 +++++- > > > mm/memory.c | 18 +++- > > > 8 files changed, 273 insertions(+), 7 deletions(-) > > > create mode 100644 mm/bpf_thp.c > > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > index 1ac0d06fb3c1..f0c91d7bd267 100644 > > > --- a/include/linux/huge_mm.h > > > +++ b/include/linux/huge_mm.h > > > @@ -6,6 +6,8 @@ > > > > > > #include <linux/fs.h> /* only for vma_is_dax() */ > > > #include <linux/kobject.h> > > > +#include <linux/pgtable.h> > > > +#include <linux/mm.h> > > > > Hm this is a bit weird as mm.h includes huge_mm... I guess it will be handled by > > header defines but still. > > Some refactoring is needed for these two header files, but we can > handle it separately later. > > > > > > > > > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); > > > int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > > @@ -56,6 +58,7 @@ enum transparent_hugepage_flag { > > > TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, > > > TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, > > > TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, > > > + TRANSPARENT_HUGEPAGE_BPF_ATTACHED, /* BPF prog is attached */ > > > }; > > > > > > struct kobject; > > > @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void) > > > (1<<TRANSPARENT_HUGEPAGE_FLAG); > > > } > > > > > > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION > > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders); > > > > Not a massive fan of this naming to be honest. I think it should explicitly > > reference bpf, e.g. bpf_hook_thp_get_order() or something. > > will change it to bpf_hook_thp_get_orders(). Thanks! > > > > > Right now this is super unclear as to what it's for. > > > > Also wrt vma_flags - this type is wrong :) it's vm_flags_t and going to change > > to a bitmap of unlimiiteeed size soon. So probs best not to pass around as value > > type either. > > As replied in another thread. I will change it. Thanks. Will check the other thread. > > > > > But unclear us to purpose as mentioned elsewhere. > > > > And also get_suggested_order() should be get_suggested_orderS() no? As you > > seem later in the code to be referencing a bitfield? > > Right, it should be bpf_hook_thp_get_orderS(). Thanks! > > > > > Also will mm ever != vma->vm_mm? > > No it can't. It can be guaranteed by the caller. In this case we don't need to pass mm separately then right? > > > > > Are we hacking this for the sake of overloading what this does? > > The @vma is actually unneeded. I will remove it. Ah OK. I am still a little concerned about passing around a value reference to the VMA flags though, esp as this type can + will change in future (not sure what that means for BPF). We may go to e.g. a 128 bit bitmap there etc. > > > > > Also if we're returning a bitmask of orders which you seem to be (not sure I > > like that tbh - I feel like we shoudl simply provide one order but open for > > disucssion) - shouldn't it return an unsigned long? > > We are indifferent to whether a single order or a bitmask is returned, > as we only use order-0 and order-9. We have no use cases for > middle-order pages, though this feature might be useful for other > architectures or for some special use cases. Well surely we want to potentially specify a mTHP under certain circumstances no? In any case I feel it's worth making any bitfield a system word size. > > > > > > +#else > > > +static inline int > > > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders) > > > +{ > > > + return orders; > > > +} > > > +#endif > > > + > > > static inline int highest_order(unsigned long orders) > > > { > > > return fls_long(orders) - 1; > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > > > index eb1946a70cff..d81c1228a21f 100644 > > > --- a/include/linux/khugepaged.h > > > +++ b/include/linux/khugepaged.h > > > @@ -4,6 +4,8 @@ > > > > > > #include <linux/mm.h> > > > > > > +#include <linux/huge_mm.h> > > > + > > > > Hm this is iffy too, There's probably a reason we didn't include this before, > > the headers can be so so fragile. Let's be cautious... > > I will check. Thanks! > > > > > > extern unsigned int khugepaged_max_ptes_none __read_mostly; > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > extern struct attribute_group khugepaged_attr_group; > > > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > > > > > static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) > > > { > > > - if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm)) > > > + /* > > > + * THP allocation policy can be dynamically modified via BPF. Even if a > > > + * task was allowed to allocate THPs, BPF can decide whether its forked > > > + * child can allocate THPs. > > > + * > > > + * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged. > > > + */ > > > + if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) && > > > + get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) > > > > Hmmm so there seems to be some kind of additional functionality you're providing > > here kinda quietly, which is to allow the exact same interface to determine > > whether we kick off khugepaged or not. > > > > Don't love that, I think we should be hugely specific about that. > > > > This bpf interface should literally be 'ok we're deciding what order we > > want'. It feels like a bit of a gross overloading? > > This makes sense. I have no objection to reverting to returning a single order. OK but key point here is - we're now determining if a forked child can _not_ allocate THPs using this function. To me this should be a separate function rather than some _weird_ usage of this same function. And generally at this point I think we should just drop this bit of code honestly. > > > > > > __khugepaged_enter(mm); > > > } > > > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > index 4108bcd96784..d10089e3f181 100644 > > > --- a/mm/Kconfig > > > +++ b/mm/Kconfig > > > @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT > > > > > > EXPERIMENTAL because the impact of some changes is still unclear. > > > > > > +config EXPERIMENTAL_BPF_ORDER_SELECTION > > > + bool "BPF-based THP order selection (EXPERIMENTAL)" > > > + depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL > > > + > > > + help > > > + Enable dynamic THP order selection using BPF programs. This > > > + experimental feature allows custom BPF logic to determine optimal > > > + transparent hugepage allocation sizes at runtime. > > > + > > > + Warning: This feature is unstable and may change in future kernel > > > + versions. > > > > Thanks! This is important to document. Absolute nitty nit: can you capitalise > > 'WARNING'? Thanks! > > will do it. Thanks! > > > > > > + > > > endif # TRANSPARENT_HUGEPAGE > > > > > > # simple helper to make the code a bit easier to read > > > diff --git a/mm/Makefile b/mm/Makefile > > > index ef54aa615d9d..cb55d1509be1 100644 > > > --- a/mm/Makefile > > > +++ b/mm/Makefile > > > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o > > > obj-$(CONFIG_NUMA) += memory-tiers.o > > > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > > > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > > > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o > > > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > > > obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o > > > obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o > > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c > > > new file mode 100644 > > > index 000000000000..fbff3b1bb988 > > > --- /dev/null > > > +++ b/mm/bpf_thp.c > > > > As mentioned before, please update MAINTAINERS for new files. I went to great + > > painful lengths to get everything listed there so let's keep it that way please > > :P > > will do it. Thanks! > > > > > > @@ -0,0 +1,186 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > + > > > +#include <linux/bpf.h> > > > +#include <linux/btf.h> > > > +#include <linux/huge_mm.h> > > > +#include <linux/khugepaged.h> > > > + > > > +struct bpf_thp_ops { > > > + /** > > > + * @get_suggested_order: Get the suggested THP orders for allocation > > > + * @mm: mm_struct associated with the THP allocation > > > + * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL) > > > + * When NULL, the decision should be based on @mm (i.e., when > > > + * triggered from an mm-scope hook rather than a VMA-specific > > > + * context). > > > + * Must belong to @mm (guaranteed by the caller). > > > + * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL) > > > + * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL) > > > + * @orders: Bitmask of requested THP orders for this allocation > > > + * - PMD-mapped allocation if PMD_ORDER is set > > > + * - mTHP allocation otherwise > > > + * > > > + * Rerurn: Bitmask of suggested THP orders for allocation. The highest > > > + * suggested order will not exceed the highest requested order > > > + * in @orders. > > > + */ > > > + int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders) __rcu; > > > > I feel like we should be declaring this function pointer type somewhere else as > > we're now duplicating this in two places. > > agreed, I have already done it to fix the spare warning. Thanks! > > > > > > +}; > > > + > > > +static struct bpf_thp_ops bpf_thp; > > > +static DEFINE_SPINLOCK(thp_ops_lock); > > > + > > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders) > > > > surely tva_flag? As this is an enum value? > > will change it to tva_type instead. Thanks! > > > > > > +{ > > > + int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type tva_flags, int orders); > > > > This type for vma flags is totally incorrect. vm_flags_t. And that's going to > > change soon to an opaque type. > > > > Also right now it's actually an unsigned long. > > > > I really really do not like that we're providing extra, unexplained VMA flags > > for some reason. I may be missing something :) so happy to hear why this is > > necessary. > > > > However in future we really shouldn't be passing something like this. > > will change it as replied in another thread. Thanks! > > > > > Also - now a third duplication of the same function pointer :) can we do better > > than this? At least typedef it. > > > > > + int suggested_orders = orders; > > > + > > > + /* No BPF program is attached */ > > > + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, > > > + &transparent_hugepage_flags)) > > > + return suggested_orders; > > > > This is atomic ofc, but are we concerned about races, or I guess you expect only > > the first attached bpf program to work with it I suppose. > > It is against the race to unreg or update. OK cool, it does make sense overall. > > > > > > + > > > + rcu_read_lock(); > > > > Is this sufficient? Anything stopping the mm or VMA going away here? > > This RCU lock is not for protecting the mm or VMA structures > themselves, but for protecting the update of the function pointer. > Arbitrary access to pointers within the mm_struct or vm_area_struct is > prohibited, as they are guarded by the BPF verifier. > > > > > > + bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order); > > > + if (!bpf_suggested_order) > > > + goto out; > > > + > > > + suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders); > > > > OK so now it's suggested order_S but we're invoking suggested order :) whaaatt? > > :) > > will change it. Thanks! > > > > > > + if (highest_order(suggested_orders) > highest_order(orders)) > > > + suggested_orders = orders; > > > > Hmmm so the semantics are - whichever is the highest order wins? > > The maximum requested order is determined by the callsite. For example: > - PMD-mapped THP uses PMD_ORDER > - mTHP uses (PMD_ORDER - 1) > > We must respect this upper bound to avoid undefined behavior. So the > highest suggested order can't exceed the highest requested order. OK, please document this in a comment here. > > > > > I thought the idea was we'd hand control over to bpf if provided in effect? > > > > Definitely worth going over these semantics in the cover letter (and do forgive > > me if you have and I've missed! :) > > It has already in the cover letter: > > * Return: Bitmask of suggested THP orders for allocation. The highest > * suggested order will not exceed the highest requested order > * in @orders. OK cool thanks, a comment here would be useful also. > > > > > > > + > > > +out: > > > + rcu_read_unlock(); > > > + return suggested_orders; > > > +} > > > + > > > +static bool bpf_thp_ops_is_valid_access(int off, int size, > > > + enum bpf_access_type type, > > > + const struct bpf_prog *prog, > > > + struct bpf_insn_access_aux *info) > > > +{ > > > + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); > > > +} > > > + > > > +static const struct bpf_func_proto * > > > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > > > +{ > > > + return bpf_base_func_proto(func_id, prog); > > > +} > > > + > > > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = { > > > + .get_func_proto = bpf_thp_get_func_proto, > > > + .is_valid_access = bpf_thp_ops_is_valid_access, > > > +}; > > > + > > > +static int bpf_thp_init(struct btf *btf) > > > +{ > > > + return 0; > > > +} > > > + > > > +static int bpf_thp_init_member(const struct btf_type *t, > > > + const struct btf_member *member, > > > + void *kdata, const void *udata) > > > +{ > > > + return 0; > > > +} > > > + > > > +static int bpf_thp_reg(void *kdata, struct bpf_link *link) > > > +{ > > > + struct bpf_thp_ops *ops = kdata; > > > + > > > + spin_lock(&thp_ops_lock); > > > + if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, > > > + &transparent_hugepage_flags)) { > > > + spin_unlock(&thp_ops_lock); > > > + return -EBUSY; > > > + } > > > + WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order)); > > > + rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order); > > > + spin_unlock(&thp_ops_lock); > > > + return 0; > > > +} > > > + > > > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link) > > > +{ > > > + spin_lock(&thp_ops_lock); > > > + clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags); > > > + WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order)); > > > + rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock)); > > > + spin_unlock(&thp_ops_lock); > > > + > > > + synchronize_rcu(); > > > +} > > > > I am a total beginner with BPF implementations so don't feel like I can say much > > intelligent about the above. But presumably fairly standard fare BPF-wise? > > This implementation is necessary to support BPF program updates. Ack. > > > > > Will perhaps try to dig deeper on another iteration :) as intersting to me. > > > > > + > > > +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link) > > > +{ > > > + struct bpf_thp_ops *ops = kdata; > > > + struct bpf_thp_ops *old = old_kdata; > > > + int ret = 0; > > > + > > > + if (!ops || !old) > > > + return -EINVAL; > > > + > > > + spin_lock(&thp_ops_lock); > > > + /* The prog has aleady been removed. */ > > > + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) { > > > + ret = -ENOENT; > > > + goto out; > > > + } > > > > OK so we gate things on this flag and it's global, got it. > > > > I see this is a hook, and I guess RCU-all-the-things is what BPF does which > > makes tonnes of sense. > > > > > + WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order)); > > > + rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order, > > > + lockdep_is_held(&thp_ops_lock)); > > > + > > > +out: > > > + spin_unlock(&thp_ops_lock); > > > + if (!ret) > > > + synchronize_rcu(); > > > + return ret; > > > +} > > > + > > > +static int bpf_thp_validate(void *kdata) > > > +{ > > > + struct bpf_thp_ops *ops = kdata; > > > + > > > + if (!ops->get_suggested_order) { > > > + pr_err("bpf_thp: required ops isn't implemented\n"); > > > + return -EINVAL; > > > + } > > > + return 0; > > > +} > > > + > > > +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable, > > > + u64 vma_flags, enum tva_type vm_flags, int orders) > > > +{ > > > + return orders; > > > +} > > > + > > > +static struct bpf_thp_ops __bpf_thp_ops = { > > > + .get_suggested_order = suggested_order, > > > +}; > > > > Can you explain to me what this stub stuff is for? This is more 'BPF impl 101' > > stuff sorry :) > > It is a CFI stub. cfi_stubs in BPF struct_ops are secure intermediary > functions that prevent the kernel from making direct, unsafe jumps to > BPF code. A new attached BPF program will run via this stub. Ack. > > > > > > + > > > +static struct bpf_struct_ops bpf_bpf_thp_ops = { > > > + .verifier_ops = &thp_bpf_verifier_ops, > > > + .init = bpf_thp_init, > > > + .init_member = bpf_thp_init_member, > > > + .reg = bpf_thp_reg, > > > + .unreg = bpf_thp_unreg, > > > + .update = bpf_thp_update, > > > + .validate = bpf_thp_validate, > > > + .cfi_stubs = &__bpf_thp_ops, > > > + .owner = THIS_MODULE, > > > + .name = "bpf_thp_ops", > > > +}; > > > + > > > +static int __init bpf_thp_ops_init(void) > > > +{ > > > + int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops); > > > + > > > + if (err) > > > + pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err); > > > + return err; > > > +} > > > +late_initcall(bpf_thp_ops_init); > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index d89992b65acc..bd8f8f34ab3c 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) > > > return ret; > > > khugepaged_enter_vma(vma, vma->vm_flags); > > > > > > + /* > > > + * This check must occur after khugepaged_enter_vma() because: > > > + * 1. We may permit THP allocation via khugepaged > > > + * 2. While simultaneously disallowing THP allocation > > > + * during page fault handling > > > + */ > > > + if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) != > > > + BIT(PMD_ORDER)) > > > > Hmmm so you return a bitmask of orders, but then you only allow this fault if > > the only order provided is PMD order? That seems strange. Can you explain? > > This is in the do_huge_pmd_anonymous_page() that can only accept a PMD > order, otherwise it might result in unexpected behavior. OK please document this in the comment. > > > > > > + return VM_FAULT_FALLBACK; > > > > It'd be good to have a helper function for this like: > > > > if (!bpf_hook_allow_pmd_order(vma, tva_flag)) > > return VM_FAULT_FALLBACK; > > > > And implemented like maybe: > > > > static bool bpf_hook_allow_pmd_order(struct vm_area_struct *vma, enum tva_type tva_flag) > > { > > int orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, tva_flag, > > BIT(PMD_ORDER)); > > > > return orders & BIT(PMD_ORDER); > > } > > > > It's good the tva flag gives context though. > > Thanks for the suggestion. > will change it. Thanks! > > > > > > + > > > if (!(vmf->flags & FAULT_FLAG_WRITE) && > > > !mm_forbids_zeropage(vma->vm_mm) && > > > transparent_hugepage_use_zero_page()) { > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > index d3d4f116e14b..935583626db6 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, > > > { > > > if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && > > > hugepage_pmd_enabled()) { > > > - if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) > > > + if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) && > > > + get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED, > > > + BIT(PMD_ORDER))) > > > > I don't know why we aren't working the bpf hook into thp_vma_allowable_order()? > > Actually it can be added into thp_vma_allowable_order(). I will change it. Thanks! > > > > > Also a helper would work here. > > > > > __khugepaged_enter(vma->vm_mm); > > > } > > > } > > > @@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > > > return SCAN_ADDRESS_RANGE; > > > if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER)) > > > return SCAN_VMA_CHECK; > > > + if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER))) > > > + return SCAN_VMA_CHECK; > > > > > > > > > /* > > > * Anon VMA expected, the address may be unmapped then > > > * remapped to file after khugepaged reaquired the mmap_lock. > > > @@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) > > > /* khugepaged_mm_lock actually not necessary for the below */ > > > mm_slot_free(mm_slot_cache, mm_slot); > > > mmdrop(mm); > > > + } else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) { > > > + hash_del(&slot->hash); > > > + list_del(&slot->mm_node); > > > + mm_flags_clear(MMF_VM_HUGEPAGE, mm); > > > + mm_slot_free(mm_slot_cache, mm_slot); > > > } > > > } > > > > > > @@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > > if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER)) > > > return SCAN_VMA_CHECK; > > > > > > + if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE, > > > + BIT(PMD_ORDER))) > > > > Again, can we please not duplicate thp_vma_allowable_order() logic? > > > > The THP code is horrible enough, but now we have to remember to also do the bpf > > check? > > makes sense. > > > > > > + return SCAN_VMA_CHECK; > > > /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ > > > if (userfaultfd_wp(vma)) > > > return SCAN_PTE_UFFD_WP; > > > @@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > > > * the next mm on the list. > > > */ > > > vma = NULL; > > > + > > > + /* If this mm is not suitable for the scan list, we should remove it. */ > > > + if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) > > > + goto breakouterloop_mmap_lock; > > > > OK again I'm really not loving this NULL, 0, -1 stuff. What is this supposed to > > mean? The idea here is we have a hook for 'trying to determine THP order' and > > now it's overloaded it seems in multiple ways? > > > > I may be missing context here. > > > > I'm also a bit perplexed by the comment as to what is intended here. > > Using a BPF-based approach for THP adjustment allows us to dynamically > enable or disable THP for running applications without causing any > disruption. This capability is particularly valuable in production > environments. The logic here is designed to achieve exactly that. > > > > > > > if (unlikely(!mmap_read_trylock(mm))) > > > goto breakouterloop_mmap_lock; > > > > > > @@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, > > > progress++; > > > break; > > > } > > > - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) { > > > + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) || > > > + !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED, > > > + BIT(PMD_ORDER))) { > > > > Same various comments from above. > > will change it. > > > > > > skip: > > > progress++; > > > continue; > > > @@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, > > > if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER)) > > > return -EINVAL; > > > > > > + if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE, > > > + BIT(PMD_ORDER))) > > > + return -EINVAL; > > > + > > > > Same various comments from above. > > will change it. > > > > > > cc = kmalloc(sizeof(*cc), GFP_KERNEL); > > > if (!cc) > > > return -ENOMEM; > > > diff --git a/mm/memory.c b/mm/memory.c > > > index d9de6c056179..0178857aa058 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > > > static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > { > > > struct vm_area_struct *vma = vmf->vma; > > > + int order, suggested_orders; > > > unsigned long orders; > > > struct folio *folio; > > > unsigned long addr; > > > @@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > spinlock_t *ptl; > > > pte_t *pte; > > > gfp_t gfp; > > > - int order; > > > > > > /* > > > * If uffd is active for the vma we need per-page fault fidelity to > > > @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > if (!zswap_never_enabled()) > > > goto fallback; > > > > > > + suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, > > > + TVA_PAGEFAULT, > > > + BIT(PMD_ORDER) - 1); > > > + if (!suggested_orders) > > > + goto fallback; > > (Thanks for all above! :) > > Wait, but below we have a bunch of fallbacks, now BPF overrides everything? > > When allocating high-order pages is not feasible, such as during > periods of high memory pressure, the system should immediately fall > back to using 4 KB pages. OK makes sense. > > > > > I know I'm repaeting myself :P but can we just please put this into > > thp_vma_allowable_orders(), it's massively gross to just duplicate this check > > _everywhere_ with subtle differences. > > will change it. Thanks > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > /* > > > * Get a list of all the (large) orders below PMD_ORDER that are enabled > > > * and suitable for swapping THP. > > > */ > > > orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, > > > - BIT(PMD_ORDER) - 1); > > > + suggested_orders); > > > orders = thp_vma_suitable_orders(vma, vmf->address, orders); > > > orders = thp_swap_suitable_orders(swp_offset(entry), > > > vmf->address, orders); > > > @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) > > > { > > > struct vm_area_struct *vma = vmf->vma; > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > + int order, suggested_orders; > > > unsigned long orders; > > > struct folio *folio; > > > unsigned long addr; > > > pte_t *pte; > > > gfp_t gfp; > > > - int order; > > > > > > /* > > > * If uffd is active for the vma we need per-page fault fidelity to > > > @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) > > > if (unlikely(userfaultfd_armed(vma))) > > > goto fallback; > > > > > > + suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, > > > + TVA_PAGEFAULT, > > > + BIT(PMD_ORDER) - 1); > > > + if (!suggested_orders) > > > + goto fallback; > > > > Same comment as above. > > will change it. Thanks! > > > Thanks a lot for your comments. No problem, thanks for the series! I am generally excited about exploring this, so once we figure out details be good to see where this can go! > > > -- > Regards > > Yafang Cheers, Lorenzo