Re: [RFC PATCH v4 1/4] mm: thp: add support for BPF based THP order selection

Zi Yan <ziy@xxxxxxxxxx> · Tue, 29 Jul 2025 11:32:30 -0400

On 29 Jul 2025, at 5:18, Yafang Shao wrote:

> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> programs to influence THP order selection based on factors such as:
> - Workload identity
>   For example, workloads running in specific containers or cgroups.
> - Allocation context
>   Whether the allocation occurs during a page fault, khugepaged, or other
>   paths.
> - System memory pressure
>   (May require new BPF helpers to accurately assess memory pressure.)
>
> Key Details:
> - Only one BPF program can be attached at a time, but it can be updated
>   dynamically to adjust the policy.
> - Supports automatic mTHP order selection and per-workload THP policies.
> - Only functional when THP is set to madise or always.
>
> Experimental Status:
> - Requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> - This feature is unstable and may evolve in future kernel versions.
>
> Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/ [0]
> Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
>
> Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> ---
>  include/linux/huge_mm.h    |  13 +++
>  include/linux/khugepaged.h |  12 ++-
>  mm/Kconfig                 |  12 +++
>  mm/Makefile                |   1 +
>  mm/bpf_thp.c               | 172 +++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c           |   9 ++
>  mm/khugepaged.c            |  18 +++-
>  mm/memory.c                |  14 ++-
>  8 files changed, 244 insertions(+), 7 deletions(-)
>  create mode 100644 mm/bpf_thp.c
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..5a1527b3b6f0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -6,6 +6,8 @@
>
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  #include <linux/kobject.h>
> +#include <linux/pgtable.h>
> +#include <linux/mm.h>
>
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -54,6 +56,7 @@ enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>  	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
>  };
>
>  struct kobject;
> @@ -190,6 +193,16 @@ static inline bool hugepage_global_always(void)
>  			(1<<TRANSPARENT_HUGEPAGE_FLAG);
>  }
>
> +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> +int get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order);
> +#else
> +static inline int
> +get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order)
> +{
> +	return order;
> +}
> +#endif
> +
>  static inline int highest_order(unsigned long orders)
>  {
>  	return fls_long(orders) - 1;
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index b8d69cfbb58b..e0242968a020 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -2,6 +2,8 @@
>  #ifndef _LINUX_KHUGEPAGED_H
>  #define _LINUX_KHUGEPAGED_H
>
> +#include <linux/huge_mm.h>
> +
>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern struct attribute_group khugepaged_attr_group;
> @@ -20,7 +22,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>
>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> -	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> +	/*
> +	 * THP allocation policy can be dynamically modified via BPF. If a
> +	 * long-lived task was previously allowed to allocate THP but is no
> +	 * longer permitted under the new policy, we must ensure its forked
> +	 * child processes also inherit this restriction.

The comment is probably better to be:

THP allocation policy can be dynamically modified via BPF. Even if a task
was allowed to allocate THPs, BPF can decide whether its forked child
can allocate THPs.

The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.

Because the code here just wants to change a forked child’s mm flag. It has
nothing to do with its parent THP policy.

> +	 * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> +	 */
> +	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags) &&
> +	    get_suggested_order(mm, 0, PMD_ORDER) == PMD_ORDER)

Will it work for mTHPs? Nico is adding mTHP support for khugepaged[1].
What if a BPF program wants khugepaged to work on some mTHP orders.

Maybe get_suggested_order() should accept a bitmask of all allowed
orders and return a bitmask as well. Only if the returned bitmask
is 0, khugepaged is not entered.

[1] https://lore.kernel.org/linux-mm/20250714003207.113275-1-npache@xxxxxxxxxx/

>  		__khugepaged_enter(mm);
>  }
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 781be3240e21..5d05a537ecde 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -908,6 +908,18 @@ config NO_PAGE_MAPCOUNT
>
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config EXPERIMENTAL_BPF_ORDER_SELECTION
> +	bool "BPF-based THP order selection (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> +
> +	help
> +	  Enable dynamic THP order selection using BPF programs. This
> +	  experimental feature allows custom BPF logic to determine optimal
> +	  transparent hugepage allocation sizes at runtime.
> +
> +	  Warning: This feature is unstable and may change in future kernel
> +	  versions.
> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  # simple helper to make the code a bit easier to read
> diff --git a/mm/Makefile b/mm/Makefile
> index 1a7a11d4933d..562525e6a28a 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> new file mode 100644
> index 000000000000..10b486dd8bc4
> --- /dev/null
> +++ b/mm/bpf_thp.c
> @@ -0,0 +1,172 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +struct bpf_thp_ops {
> +	/**
> +	 * @get_suggested_order: Get the suggested highest THP order for allocation
> +	 * @mm: mm_struct associated with the THP allocation
> +	 * @tva_flags: TVA flags for current context
> +	 *             %TVA_IN_PF: Set when in page fault context
> +	 *             Other flags: Reserved for future use
> +	 * @order: The highest order being considered for this THP allocation.
> +	 *         %PUD_ORDER for PUD-mapped allocations

Like I mentioned in the cover letter, PMD_ORDER is the highest order
mm currently supports. I wonder if it is better to be a bitmask of orders
to better support mTHP.

> +	 *         %PMD_ORDER for PMD-mapped allocations
> +	 *         %PMD_ORDER - 1 for mTHP allocations
> +	 *
> +	 * Rerurn: Suggested highest THP order to use for allocation. The returned
> +	 * order will never exceed the input @order value.
> +	 */
> +	int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order) __rcu;
> +};
> +

Best Regards,
Yan, Zi