On Thu, Sep 11, 2025 at 10:42:26PM +0800, Lance Yang wrote: > > > On 2025/9/11 22:02, Lorenzo Stoakes wrote: > > On Wed, Sep 10, 2025 at 08:42:37PM +0800, Lance Yang wrote: > > > Hey Yafang, > > > > > > On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic > > > > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF > > > > programs to influence THP order selection based on factors such as: > > > > - Workload identity > > > > For example, workloads running in specific containers or cgroups. > > > > - Allocation context > > > > Whether the allocation occurs during a page fault, khugepaged, swap or > > > > other paths. > > > > - VMA's memory advice settings > > > > MADV_HUGEPAGE or MADV_NOHUGEPAGE > > > > - Memory pressure > > > > PSI system data or associated cgroup PSI metrics > > > > > > > > The kernel API of this new BPF hook is as follows, > > > > > > > > /** > > > > * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation > > > > * @vma: vm_area_struct associated with the THP allocation > > > > * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set > > > > * BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if > > > > * neither is set. > > > > * @tva_type: TVA type for current @vma > > > > * @orders: Bitmask of requested THP orders for this allocation > > > > * - PMD-mapped allocation if PMD_ORDER is set > > > > * - mTHP allocation otherwise > > > > * > > > > * Return: The suggested THP order from the BPF program for allocation. It will > > > > * not exceed the highest requested order in @orders. Return -1 to > > > > * indicate that the original requested @orders should remain unchanged. > > > > */ > > > > typedef int thp_order_fn_t(struct vm_area_struct *vma, > > > > enum bpf_thp_vma_type vma_type, > > > > enum tva_type tva_type, > > > > unsigned long orders); > > > > > > > > Only a single BPF program can be attached at any given time, though it can > > > > be dynamically updated to adjust the policy. The implementation supports > > > > anonymous THP, shmem THP, and mTHP, with future extensions planned for > > > > file-backed THP. > > > > > > > > This functionality is only active when system-wide THP is configured to > > > > madvise or always mode. It remains disabled in never mode. Additionally, > > > > if THP is explicitly disabled for a specific task via prctl(), this BPF > > > > functionality will also be unavailable for that task. > > > > > > > > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be > > > > enabled. Note that this capability is currently unstable and may undergo > > > > significant changes—including potential removal—in future kernel versions. > > > > > > > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > > > --- > > > [...] > > > > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c > > > > new file mode 100644 > > > > index 000000000000..525ee22ab598 > > > > --- /dev/null > > > > +++ b/mm/huge_memory_bpf.c > > > > @@ -0,0 +1,243 @@ > > > > +// SPDX-License-Identifier: GPL-2.0 > > > > +/* > > > > + * BPF-based THP policy management > > > > + * > > > > + * Author: Yafang Shao <laoar.shao@xxxxxxxxx> > > > > + */ > > > > + > > > > +#include <linux/bpf.h> > > > > +#include <linux/btf.h> > > > > +#include <linux/huge_mm.h> > > > > +#include <linux/khugepaged.h> > > > > + > > > > +enum bpf_thp_vma_type { > > > > + BPF_THP_VM_NONE = 0, > > > > + BPF_THP_VM_HUGEPAGE, /* VM_HUGEPAGE */ > > > > + BPF_THP_VM_NOHUGEPAGE, /* VM_NOHUGEPAGE */ > > > > +}; > > > > + > > > > +/** > > > > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation > > > > + * @vma: vm_area_struct associated with the THP allocation > > > > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set > > > > + * BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if > > > > + * neither is set. > > > > + * @tva_type: TVA type for current @vma > > > > + * @orders: Bitmask of requested THP orders for this allocation > > > > + * - PMD-mapped allocation if PMD_ORDER is set > > > > + * - mTHP allocation otherwise > > > > + * > > > > + * Return: The suggested THP order from the BPF program for allocation. It will > > > > + * not exceed the highest requested order in @orders. Return -1 to > > > > + * indicate that the original requested @orders should remain unchanged. > > > > > > A minor documentation nit: the comment says "Return -1 to indicate that the > > > original requested @orders should remain unchanged". It might be slightly > > > clearer to say "Return a negative value to fall back to the original > > > behavior". This would cover all error codes as well ;) > > > > > > > + */ > > > > +typedef int thp_order_fn_t(struct vm_area_struct *vma, > > > > + enum bpf_thp_vma_type vma_type, > > > > + enum tva_type tva_type, > > > > + unsigned long orders); > > > > > > Sorry if I'm missing some context here since I haven't tracked the whole > > > series closely. > > > > > > Regarding the return value for thp_order_fn_t: right now it returns a > > > single int order. I was thinking, what if we let it return an unsigned > > > long bitmask of orders instead? This seems like it would be more flexible > > > down the road, especially if we get more mTHP sizes to choose from. It > > > would also make the API more consistent, as bpf_hook_thp_get_orders() > > > itself returns an unsigned long ;) > > > > I think that adds confusion - as in how an order might be chosen from > > those. Also we have _received_ a bitmap of available orders - and the intent > > here is to select _which one we should use_. > > Yep. Makes sense to me ;) Thanks :) > > > > > And this is an experimental feature, behind a flag explicitly labelled as > > experimental (and thus subject to change) so if we found we needed to change > > things in the future we can. > > You're right, I didn't pay enough attention to the fact that this is > an experimental feature. So my suggestions were based on a lack of > context ... It's fine, don't worry :) these are sensible suggestions - it to me highlights that we haven't been clear enough perhaps. > > > > > > > > > Also, for future extensions, it might be a good idea to add a reserved > > > flags argument to the thp_order_fn_t signature. > > > > We don't need to do anything like this, as we are behind an experimental flag > > and in no way guarantee that this will be used this way going forwards. > > > > > > For example thp_order_fn_t(..., unsigned long flags). > > > > > > This would give us aforward-compatible way to add new semantics later > > > without breaking the ABI and needing a v2. We could just require it to be > > > 0 for now. > > > > There is no ABI. > > > > I mean again to emphasise, this is an _experimental_ feature not to be relied > > upon in production. > > > > > > > > Thanks for the great work! > > > Lance > > > > Perhaps we need to put a 'EXPERIMENTAL_' prefix on the config flag too to really > > bring this home, as it's perhaps not all that clear :) > > No need for a 'EXPERIMENTAL_' prefix, it was just me missing > the background. Appreciate you clarifying this! Don't worry about it, but also it suggests that we probably need to be ultra-super clear to users in general. So I think an _EXPERIMENTAL suffix is probably pretty valid here just to _hammer home_ that - hey - we might break you! :) > > Cheers, > Lance > Cheers, Lorenzo