On Wed, Sep 10, 2025 at 8:42 PM Lance Yang <lance.yang@xxxxxxxxx> wrote: > > Hey Yafang, > > On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic > > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF > > programs to influence THP order selection based on factors such as: > > - Workload identity > > For example, workloads running in specific containers or cgroups. > > - Allocation context > > Whether the allocation occurs during a page fault, khugepaged, swap or > > other paths. > > - VMA's memory advice settings > > MADV_HUGEPAGE or MADV_NOHUGEPAGE > > - Memory pressure > > PSI system data or associated cgroup PSI metrics > > > > The kernel API of this new BPF hook is as follows, > > > > /** > > * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation > > * @vma: vm_area_struct associated with the THP allocation > > * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set > > * BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if > > * neither is set. > > * @tva_type: TVA type for current @vma > > * @orders: Bitmask of requested THP orders for this allocation > > * - PMD-mapped allocation if PMD_ORDER is set > > * - mTHP allocation otherwise > > * > > * Return: The suggested THP order from the BPF program for allocation. It will > > * not exceed the highest requested order in @orders. Return -1 to > > * indicate that the original requested @orders should remain unchanged. > > */ > > typedef int thp_order_fn_t(struct vm_area_struct *vma, > > enum bpf_thp_vma_type vma_type, > > enum tva_type tva_type, > > unsigned long orders); > > > > Only a single BPF program can be attached at any given time, though it can > > be dynamically updated to adjust the policy. The implementation supports > > anonymous THP, shmem THP, and mTHP, with future extensions planned for > > file-backed THP. > > > > This functionality is only active when system-wide THP is configured to > > madvise or always mode. It remains disabled in never mode. Additionally, > > if THP is explicitly disabled for a specific task via prctl(), this BPF > > functionality will also be unavailable for that task. > > > > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be > > enabled. Note that this capability is currently unstable and may undergo > > significant changes—including potential removal—in future kernel versions. > > > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > --- > [...] > > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c > > new file mode 100644 > > index 000000000000..525ee22ab598 > > --- /dev/null > > +++ b/mm/huge_memory_bpf.c > > @@ -0,0 +1,243 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * BPF-based THP policy management > > + * > > + * Author: Yafang Shao <laoar.shao@xxxxxxxxx> > > + */ > > + > > +#include <linux/bpf.h> > > +#include <linux/btf.h> > > +#include <linux/huge_mm.h> > > +#include <linux/khugepaged.h> > > + > > +enum bpf_thp_vma_type { > > + BPF_THP_VM_NONE = 0, > > + BPF_THP_VM_HUGEPAGE, /* VM_HUGEPAGE */ > > + BPF_THP_VM_NOHUGEPAGE, /* VM_NOHUGEPAGE */ > > +}; > > + > > +/** > > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation > > + * @vma: vm_area_struct associated with the THP allocation > > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set > > + * BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if > > + * neither is set. > > + * @tva_type: TVA type for current @vma > > + * @orders: Bitmask of requested THP orders for this allocation > > + * - PMD-mapped allocation if PMD_ORDER is set > > + * - mTHP allocation otherwise > > + * > > + * Return: The suggested THP order from the BPF program for allocation. It will > > + * not exceed the highest requested order in @orders. Return -1 to > > + * indicate that the original requested @orders should remain unchanged. > > A minor documentation nit: the comment says "Return -1 to indicate that the > original requested @orders should remain unchanged". It might be slightly > clearer to say "Return a negative value to fall back to the original > behavior". This would cover all error codes as well ;) > > > + */ > > +typedef int thp_order_fn_t(struct vm_area_struct *vma, > > + enum bpf_thp_vma_type vma_type, > > + enum tva_type tva_type, > > + unsigned long orders); > > Sorry if I'm missing some context here since I haven't tracked the whole > series closely. > > Regarding the return value for thp_order_fn_t: right now it returns a > single int order. I was thinking, what if we let it return an unsigned > long bitmask of orders instead? This seems like it would be more flexible > down the road, especially if we get more mTHP sizes to choose from. It > would also make the API more consistent, as bpf_hook_thp_get_orders() > itself returns an unsigned long ;) I just realized a flaw in my previous suggestion :( Changing the return type of thp_order_fn_t to unsigned long for consistency and flexibility. However, I completely overlooked that this would prevent the BPF program from returning negative error codes ... Thanks, Lance > > Also, for future extensions, it might be a good idea to add a reserved > flags argument to the thp_order_fn_t signature. > > For example thp_order_fn_t(..., unsigned long flags). > > This would give us aforward-compatible way to add new semantics later > without breaking the ABI and needing a v2. We could just require it to be > 0 for now. > > Thanks for the great work! > Lance