On 29 Jul 2025, at 5:18, Yafang Shao wrote: > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF > programs to influence THP order selection based on factors such as: > - Workload identity > For example, workloads running in specific containers or cgroups. > - Allocation context > Whether the allocation occurs during a page fault, khugepaged, or other > paths. > - System memory pressure > (May require new BPF helpers to accurately assess memory pressure.) > > Key Details: > - Only one BPF program can be attached at a time, but it can be updated > dynamically to adjust the policy. > - Supports automatic mTHP order selection and per-workload THP policies. > - Only functional when THP is set to madise or always. > > Experimental Status: > - Requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1] > - This feature is unstable and may evolve in future kernel versions. > > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ [0] > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1] > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > --- > include/linux/huge_mm.h | 13 +++ > include/linux/khugepaged.h | 12 ++- > mm/Kconfig | 12 +++ > mm/Makefile | 1 + > mm/bpf_thp.c | 172 +++++++++++++++++++++++++++++++++++++ > mm/huge_memory.c | 9 ++ > mm/khugepaged.c | 18 +++- > mm/memory.c | 14 ++- > 8 files changed, 244 insertions(+), 7 deletions(-) > create mode 100644 mm/bpf_thp.c > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 2f190c90192d..5a1527b3b6f0 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -6,6 +6,8 @@ > > #include <linux/fs.h> /* only for vma_is_dax() */ > #include <linux/kobject.h> > +#include <linux/pgtable.h> > +#include <linux/mm.h> > > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); > int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > @@ -54,6 +56,7 @@ enum transparent_hugepage_flag { > TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, > TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, > TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, > + TRANSPARENT_HUGEPAGE_BPF_ATTACHED, /* BPF prog is attached */ > }; > > struct kobject; > @@ -190,6 +193,16 @@ static inline bool hugepage_global_always(void) > (1<<TRANSPARENT_HUGEPAGE_FLAG); > } > > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION > +int get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order); > +#else > +static inline int > +get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order) > +{ > + return order; > +} > +#endif > + > static inline int highest_order(unsigned long orders) > { > return fls_long(orders) - 1; > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > index b8d69cfbb58b..e0242968a020 100644 > --- a/include/linux/khugepaged.h > +++ b/include/linux/khugepaged.h > @@ -2,6 +2,8 @@ > #ifndef _LINUX_KHUGEPAGED_H > #define _LINUX_KHUGEPAGED_H > > +#include <linux/huge_mm.h> > + > extern unsigned int khugepaged_max_ptes_none __read_mostly; > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > extern struct attribute_group khugepaged_attr_group; > @@ -20,7 +22,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) > { > - if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags)) > + /* > + * THP allocation policy can be dynamically modified via BPF. If a > + * long-lived task was previously allowed to allocate THP but is no > + * longer permitted under the new policy, we must ensure its forked > + * child processes also inherit this restriction. The comment is probably better to be: THP allocation policy can be dynamically modified via BPF. Even if a task was allowed to allocate THPs, BPF can decide whether its forked child can allocate THPs. The MMF_VM_HUGEPAGE flag will be cleared by khugepaged. Because the code here just wants to change a forked child’s mm flag. It has nothing to do with its parent THP policy. > + * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged. > + */ > + if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags) && > + get_suggested_order(mm, 0, PMD_ORDER) == PMD_ORDER) Will it work for mTHPs? Nico is adding mTHP support for khugepaged[1]. What if a BPF program wants khugepaged to work on some mTHP orders. Maybe get_suggested_order() should accept a bitmask of all allowed orders and return a bitmask as well. Only if the returned bitmask is 0, khugepaged is not entered. [1] https://lore.kernel.org/linux-mm/20250714003207.113275-1-npache@xxxxxxxxxx/ > __khugepaged_enter(mm); > } > > diff --git a/mm/Kconfig b/mm/Kconfig > index 781be3240e21..5d05a537ecde 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -908,6 +908,18 @@ config NO_PAGE_MAPCOUNT > > EXPERIMENTAL because the impact of some changes is still unclear. > > +config EXPERIMENTAL_BPF_ORDER_SELECTION > + bool "BPF-based THP order selection (EXPERIMENTAL)" > + depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL > + > + help > + Enable dynamic THP order selection using BPF programs. This > + experimental feature allows custom BPF logic to determine optimal > + transparent hugepage allocation sizes at runtime. > + > + Warning: This feature is unstable and may change in future kernel > + versions. > + > endif # TRANSPARENT_HUGEPAGE > > # simple helper to make the code a bit easier to read > diff --git a/mm/Makefile b/mm/Makefile > index 1a7a11d4933d..562525e6a28a 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o > obj-$(CONFIG_NUMA) += memory-tiers.o > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o > obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c > new file mode 100644 > index 000000000000..10b486dd8bc4 > --- /dev/null > +++ b/mm/bpf_thp.c > @@ -0,0 +1,172 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include <linux/bpf.h> > +#include <linux/btf.h> > +#include <linux/huge_mm.h> > +#include <linux/khugepaged.h> > + > +struct bpf_thp_ops { > + /** > + * @get_suggested_order: Get the suggested highest THP order for allocation > + * @mm: mm_struct associated with the THP allocation > + * @tva_flags: TVA flags for current context > + * %TVA_IN_PF: Set when in page fault context > + * Other flags: Reserved for future use > + * @order: The highest order being considered for this THP allocation. > + * %PUD_ORDER for PUD-mapped allocations Like I mentioned in the cover letter, PMD_ORDER is the highest order mm currently supports. I wonder if it is better to be a bitmask of orders to better support mTHP. > + * %PMD_ORDER for PMD-mapped allocations > + * %PMD_ORDER - 1 for mTHP allocations > + * > + * Rerurn: Suggested highest THP order to use for allocation. The returned > + * order will never exceed the input @order value. > + */ > + int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order) __rcu; > +}; > + Best Regards, Yan, Zi