On Wed, Sep 10, 2025 at 9:57 PM Lance Yang <lance.yang@xxxxxxxxx> wrote: > > > > On 2025/9/10 20:54, Lance Yang wrote: > > On Wed, Sep 10, 2025 at 8:42 PM Lance Yang <lance.yang@xxxxxxxxx> wrote: > >> > >> Hey Yafang, > >> > >> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > >>> > >>> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic > >>> THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF > >>> programs to influence THP order selection based on factors such as: > >>> - Workload identity > >>> For example, workloads running in specific containers or cgroups. > >>> - Allocation context > >>> Whether the allocation occurs during a page fault, khugepaged, swap or > >>> other paths. > >>> - VMA's memory advice settings > >>> MADV_HUGEPAGE or MADV_NOHUGEPAGE > >>> - Memory pressure > >>> PSI system data or associated cgroup PSI metrics > >>> > >>> The kernel API of this new BPF hook is as follows, > >>> > >>> /** > >>> * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation > >>> * @vma: vm_area_struct associated with the THP allocation > >>> * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set > >>> * BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if > >>> * neither is set. > >>> * @tva_type: TVA type for current @vma > >>> * @orders: Bitmask of requested THP orders for this allocation > >>> * - PMD-mapped allocation if PMD_ORDER is set > >>> * - mTHP allocation otherwise > >>> * > >>> * Return: The suggested THP order from the BPF program for allocation. It will > >>> * not exceed the highest requested order in @orders. Return -1 to > >>> * indicate that the original requested @orders should remain unchanged. > >>> */ > >>> typedef int thp_order_fn_t(struct vm_area_struct *vma, > >>> enum bpf_thp_vma_type vma_type, > >>> enum tva_type tva_type, > >>> unsigned long orders); > >>> > >>> Only a single BPF program can be attached at any given time, though it can > >>> be dynamically updated to adjust the policy. The implementation supports > >>> anonymous THP, shmem THP, and mTHP, with future extensions planned for > >>> file-backed THP. > >>> > >>> This functionality is only active when system-wide THP is configured to > >>> madvise or always mode. It remains disabled in never mode. Additionally, > >>> if THP is explicitly disabled for a specific task via prctl(), this BPF > >>> functionality will also be unavailable for that task. > >>> > >>> This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be > >>> enabled. Note that this capability is currently unstable and may undergo > >>> significant changes—including potential removal—in future kernel versions. > >>> > >>> Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > >>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > >>> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > >>> --- > >> [...] > >>> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c > >>> new file mode 100644 > >>> index 000000000000..525ee22ab598 > >>> --- /dev/null > >>> +++ b/mm/huge_memory_bpf.c > >>> @@ -0,0 +1,243 @@ > >>> +// SPDX-License-Identifier: GPL-2.0 > >>> +/* > >>> + * BPF-based THP policy management > >>> + * > >>> + * Author: Yafang Shao <laoar.shao@xxxxxxxxx> > >>> + */ > >>> + > >>> +#include <linux/bpf.h> > >>> +#include <linux/btf.h> > >>> +#include <linux/huge_mm.h> > >>> +#include <linux/khugepaged.h> > >>> + > >>> +enum bpf_thp_vma_type { > >>> + BPF_THP_VM_NONE = 0, > >>> + BPF_THP_VM_HUGEPAGE, /* VM_HUGEPAGE */ > >>> + BPF_THP_VM_NOHUGEPAGE, /* VM_NOHUGEPAGE */ > >>> +}; > >>> + > >>> +/** > >>> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation > >>> + * @vma: vm_area_struct associated with the THP allocation > >>> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set > >>> + * BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if > >>> + * neither is set. > >>> + * @tva_type: TVA type for current @vma > >>> + * @orders: Bitmask of requested THP orders for this allocation > >>> + * - PMD-mapped allocation if PMD_ORDER is set > >>> + * - mTHP allocation otherwise > >>> + * > >>> + * Return: The suggested THP order from the BPF program for allocation. It will > >>> + * not exceed the highest requested order in @orders. Return -1 to > >>> + * indicate that the original requested @orders should remain unchanged. > >> > >> A minor documentation nit: the comment says "Return -1 to indicate that the > >> original requested @orders should remain unchanged". It might be slightly > >> clearer to say "Return a negative value to fall back to the original > >> behavior". This would cover all error codes as well ;) will change it. > >> > >>> + */ > >>> +typedef int thp_order_fn_t(struct vm_area_struct *vma, > >>> + enum bpf_thp_vma_type vma_type, > >>> + enum tva_type tva_type, > >>> + unsigned long orders); > >> > >> Sorry if I'm missing some context here since I haven't tracked the whole > >> series closely. > >> > >> Regarding the return value for thp_order_fn_t: right now it returns a > >> single int order. I was thinking, what if we let it return an unsigned > >> long bitmask of orders instead? This seems like it would be more flexible > >> down the road, especially if we get more mTHP sizes to choose from. It > >> would also make the API more consistent, as bpf_hook_thp_get_orders() > >> itself returns an unsigned long ;) > > > > I just realized a flaw in my previous suggestion :( > > > > Changing the return type of thp_order_fn_t to unsigned long for consistency > > and flexibility. However, I completely overlooked that this would prevent > > the BPF program from returning negative error codes ... > > > > Thanks, > > Lance > > > >> > >> Also, for future extensions, it might be a good idea to add a reserved > >> flags argument to the thp_order_fn_t signature. > >> > >> For example thp_order_fn_t(..., unsigned long flags). > >> > >> This would give us aforward-compatible way to add new semantics later > >> without breaking the ABI and needing a v2. We could just require it to be > >> 0 for now. That makes sense. However, as Lorenzo mentioned previously, we should keep the interface as minimal as possible. > >> > >> Thanks for the great work! > >> Lance > > > Forgot to add: > > Noticed that if the hook returns 0, bpf_hook_thp_get_orders() falls > back to 'orders', preventing us from dynamically disabling mTHP > allocations. Could you please clarify what you mean by that? + thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders); + if (thp_order < 0) + goto out; In my implementation, it only falls back to @orders if the return value is negative. If the return value is 0, it uses BIT(0): + if (thp_order <= highest_order(orders)) + thp_orders = BIT(thp_order); > > Honoring a return of 0 is critical for our use case, which is to > dynamically disable mTHP for low-priority containers when memory gets > low in mixed workloads. > > And then re-enable it for them when memory is back above the low > watermark. Thank you for detailing your use case; that context is very helpful. -- Regards Yafang