On Tue, Jul 29, 2025 at 11:32 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > > On 29 Jul 2025, at 5:18, Yafang Shao wrote: > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic > > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF > > programs to influence THP order selection based on factors such as: > > - Workload identity > > For example, workloads running in specific containers or cgroups. > > - Allocation context > > Whether the allocation occurs during a page fault, khugepaged, or other > > paths. > > - System memory pressure > > (May require new BPF helpers to accurately assess memory pressure.) > > > > Key Details: > > - Only one BPF program can be attached at a time, but it can be updated > > dynamically to adjust the policy. > > - Supports automatic mTHP order selection and per-workload THP policies. > > - Only functional when THP is set to madise or always. > > > > Experimental Status: > > - Requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1] > > - This feature is unstable and may evolve in future kernel versions. > > > > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ [0] > > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1] > > > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > --- > > include/linux/huge_mm.h | 13 +++ > > include/linux/khugepaged.h | 12 ++- > > mm/Kconfig | 12 +++ > > mm/Makefile | 1 + > > mm/bpf_thp.c | 172 +++++++++++++++++++++++++++++++++++++ > > mm/huge_memory.c | 9 ++ > > mm/khugepaged.c | 18 +++- > > mm/memory.c | 14 ++- > > 8 files changed, 244 insertions(+), 7 deletions(-) > > create mode 100644 mm/bpf_thp.c > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index 2f190c90192d..5a1527b3b6f0 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -6,6 +6,8 @@ > > > > #include <linux/fs.h> /* only for vma_is_dax() */ > > #include <linux/kobject.h> > > +#include <linux/pgtable.h> > > +#include <linux/mm.h> > > > > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); > > int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > @@ -54,6 +56,7 @@ enum transparent_hugepage_flag { > > TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, > > TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, > > TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, > > + TRANSPARENT_HUGEPAGE_BPF_ATTACHED, /* BPF prog is attached */ > > }; > > > > struct kobject; > > @@ -190,6 +193,16 @@ static inline bool hugepage_global_always(void) > > (1<<TRANSPARENT_HUGEPAGE_FLAG); > > } > > > > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION > > +int get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order); > > +#else > > +static inline int > > +get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order) > > +{ > > + return order; > > +} > > +#endif > > + > > static inline int highest_order(unsigned long orders) > > { > > return fls_long(orders) - 1; > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > > index b8d69cfbb58b..e0242968a020 100644 > > --- a/include/linux/khugepaged.h > > +++ b/include/linux/khugepaged.h > > @@ -2,6 +2,8 @@ > > #ifndef _LINUX_KHUGEPAGED_H > > #define _LINUX_KHUGEPAGED_H > > > > +#include <linux/huge_mm.h> > > + > > extern unsigned int khugepaged_max_ptes_none __read_mostly; > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > extern struct attribute_group khugepaged_attr_group; > > @@ -20,7 +22,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > > > static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) > > { > > - if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags)) > > + /* > > + * THP allocation policy can be dynamically modified via BPF. If a > > + * long-lived task was previously allowed to allocate THP but is no > > + * longer permitted under the new policy, we must ensure its forked > > + * child processes also inherit this restriction. > > The comment is probably better to be: > > THP allocation policy can be dynamically modified via BPF. Even if a task > was allowed to allocate THPs, BPF can decide whether its forked child > can allocate THPs. > > The MMF_VM_HUGEPAGE flag will be cleared by khugepaged. > > Because the code here just wants to change a forked child’s mm flag. It has > nothing to do with its parent THP policy. Thanks for the improvement. I will change it. > > > + * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged. > > + */ > > + if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags) && > > + get_suggested_order(mm, 0, PMD_ORDER) == PMD_ORDER) > > Will it work for mTHPs? Nico is adding mTHP support for khugepaged[1]. > What if a BPF program wants khugepaged to work on some mTHP orders. > > Maybe get_suggested_order() should accept a bitmask of all allowed > orders and return a bitmask as well. Only if the returned bitmask > is 0, khugepaged is not entered. > > [1] https://lore.kernel.org/linux-mm/20250714003207.113275-1-npache@xxxxxxxxxx/ Thanks for the information. It seems extending this to use a bitmask would better accommodate future changes. I’ll give it some thought. > > > __khugepaged_enter(mm); > > } > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 781be3240e21..5d05a537ecde 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -908,6 +908,18 @@ config NO_PAGE_MAPCOUNT > > > > EXPERIMENTAL because the impact of some changes is still unclear. > > > > +config EXPERIMENTAL_BPF_ORDER_SELECTION > > + bool "BPF-based THP order selection (EXPERIMENTAL)" > > + depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL > > + > > + help > > + Enable dynamic THP order selection using BPF programs. This > > + experimental feature allows custom BPF logic to determine optimal > > + transparent hugepage allocation sizes at runtime. > > + > > + Warning: This feature is unstable and may change in future kernel > > + versions. > > + > > endif # TRANSPARENT_HUGEPAGE > > > > # simple helper to make the code a bit easier to read > > diff --git a/mm/Makefile b/mm/Makefile > > index 1a7a11d4933d..562525e6a28a 100644 > > --- a/mm/Makefile > > +++ b/mm/Makefile > > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o > > obj-$(CONFIG_NUMA) += memory-tiers.o > > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o > > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o > > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o > > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > > obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o > > obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c > > new file mode 100644 > > index 000000000000..10b486dd8bc4 > > --- /dev/null > > +++ b/mm/bpf_thp.c > > @@ -0,0 +1,172 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > + > > +#include <linux/bpf.h> > > +#include <linux/btf.h> > > +#include <linux/huge_mm.h> > > +#include <linux/khugepaged.h> > > + > > +struct bpf_thp_ops { > > + /** > > + * @get_suggested_order: Get the suggested highest THP order for allocation > > + * @mm: mm_struct associated with the THP allocation > > + * @tva_flags: TVA flags for current context > > + * %TVA_IN_PF: Set when in page fault context > > + * Other flags: Reserved for future use > > + * @order: The highest order being considered for this THP allocation. > > + * %PUD_ORDER for PUD-mapped allocations > > Like I mentioned in the cover letter, PMD_ORDER is the highest order > mm currently supports. I wonder if it is better to be a bitmask of orders > to better support mTHP. I’ll look into it. Regards Yafang