Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 10, 2025 at 9:57 PM Lance Yang <lance.yang@xxxxxxxxx> wrote:
>
>
>
> On 2025/9/10 20:54, Lance Yang wrote:
> > On Wed, Sep 10, 2025 at 8:42 PM Lance Yang <lance.yang@xxxxxxxxx> wrote:
> >>
> >> Hey Yafang,
> >>
> >> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> >>>
> >>> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> >>> THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> >>> programs to influence THP order selection based on factors such as:
> >>> - Workload identity
> >>>    For example, workloads running in specific containers or cgroups.
> >>> - Allocation context
> >>>    Whether the allocation occurs during a page fault, khugepaged, swap or
> >>>    other paths.
> >>> - VMA's memory advice settings
> >>>    MADV_HUGEPAGE or MADV_NOHUGEPAGE
> >>> - Memory pressure
> >>>    PSI system data or associated cgroup PSI metrics
> >>>
> >>> The kernel API of this new BPF hook is as follows,
> >>>
> >>> /**
> >>>   * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >>>   * @vma: vm_area_struct associated with the THP allocation
> >>>   * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >>>   *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >>>   *            neither is set.
> >>>   * @tva_type: TVA type for current @vma
> >>>   * @orders: Bitmask of requested THP orders for this allocation
> >>>   *          - PMD-mapped allocation if PMD_ORDER is set
> >>>   *          - mTHP allocation otherwise
> >>>   *
> >>>   * Return: The suggested THP order from the BPF program for allocation. It will
> >>>   *         not exceed the highest requested order in @orders. Return -1 to
> >>>   *         indicate that the original requested @orders should remain unchanged.
> >>>   */
> >>> typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >>>                             enum bpf_thp_vma_type vma_type,
> >>>                             enum tva_type tva_type,
> >>>                             unsigned long orders);
> >>>
> >>> Only a single BPF program can be attached at any given time, though it can
> >>> be dynamically updated to adjust the policy. The implementation supports
> >>> anonymous THP, shmem THP, and mTHP, with future extensions planned for
> >>> file-backed THP.
> >>>
> >>> This functionality is only active when system-wide THP is configured to
> >>> madvise or always mode. It remains disabled in never mode. Additionally,
> >>> if THP is explicitly disabled for a specific task via prctl(), this BPF
> >>> functionality will also be unavailable for that task.
> >>>
> >>> This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> >>> enabled. Note that this capability is currently unstable and may undergo
> >>> significant changes—including potential removal—in future kernel versions.
> >>>
> >>> Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> >>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> >>> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> >>> ---
> >> [...]
> >>> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> >>> new file mode 100644
> >>> index 000000000000..525ee22ab598
> >>> --- /dev/null
> >>> +++ b/mm/huge_memory_bpf.c
> >>> @@ -0,0 +1,243 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>> +/*
> >>> + * BPF-based THP policy management
> >>> + *
> >>> + * Author: Yafang Shao <laoar.shao@xxxxxxxxx>
> >>> + */
> >>> +
> >>> +#include <linux/bpf.h>
> >>> +#include <linux/btf.h>
> >>> +#include <linux/huge_mm.h>
> >>> +#include <linux/khugepaged.h>
> >>> +
> >>> +enum bpf_thp_vma_type {
> >>> +       BPF_THP_VM_NONE = 0,
> >>> +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> >>> +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> >>> +};
> >>> +
> >>> +/**
> >>> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >>> + * @vma: vm_area_struct associated with the THP allocation
> >>> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >>> + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >>> + *            neither is set.
> >>> + * @tva_type: TVA type for current @vma
> >>> + * @orders: Bitmask of requested THP orders for this allocation
> >>> + *          - PMD-mapped allocation if PMD_ORDER is set
> >>> + *          - mTHP allocation otherwise
> >>> + *
> >>> + * Return: The suggested THP order from the BPF program for allocation. It will
> >>> + *         not exceed the highest requested order in @orders. Return -1 to
> >>> + *         indicate that the original requested @orders should remain unchanged.
> >>
> >> A minor documentation nit: the comment says "Return -1 to indicate that the
> >> original requested @orders should remain unchanged". It might be slightly
> >> clearer to say "Return a negative value to fall back to the original
> >> behavior". This would cover all error codes as well ;)

will change it.

> >>
> >>> + */
> >>> +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >>> +                          enum bpf_thp_vma_type vma_type,
> >>> +                          enum tva_type tva_type,
> >>> +                          unsigned long orders);
> >>
> >> Sorry if I'm missing some context here since I haven't tracked the whole
> >> series closely.
> >>
> >> Regarding the return value for thp_order_fn_t: right now it returns a
> >> single int order. I was thinking, what if we let it return an unsigned
> >> long bitmask of orders instead? This seems like it would be more flexible
> >> down the road, especially if we get more mTHP sizes to choose from. It
> >> would also make the API more consistent, as bpf_hook_thp_get_orders()
> >> itself returns an unsigned long ;)
> >
> > I just realized a flaw in my previous suggestion :(
> >
> > Changing the return type of thp_order_fn_t to unsigned long for consistency
> > and flexibility. However, I completely overlooked that this would prevent
> > the BPF program from returning negative error codes ...
> >
> > Thanks,
> > Lance
> >
> >>
> >> Also, for future extensions, it might be a good idea to add a reserved
> >> flags argument to the thp_order_fn_t signature.
> >>
> >> For example thp_order_fn_t(..., unsigned long flags).
> >>
> >> This would give us aforward-compatible way to add new semantics later
> >> without breaking the ABI and needing a v2. We could just require it to be
> >> 0 for now.

That makes sense. However, as Lorenzo mentioned previously, we should
keep the interface as minimal as possible.

> >>
> >> Thanks for the great work!
> >> Lance
>
>
> Forgot to add:
>
> Noticed that if the hook returns 0, bpf_hook_thp_get_orders() falls
> back to 'orders', preventing us from dynamically disabling mTHP
> allocations.

Could you please clarify what you mean by that?

+       thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
+       if (thp_order < 0)
+               goto out;

In my implementation, it only falls back to @orders if the return
value is negative. If the return value is 0, it uses BIT(0):

+       if (thp_order <= highest_order(orders))
+               thp_orders = BIT(thp_order);

>
> Honoring a return of 0 is critical for our use case, which is to
> dynamically disable mTHP for low-priority containers when memory gets
> low in mixed workloads.
>
> And then re-enable it for them when memory is back above the low
> watermark.

Thank you for detailing your use case; that context is very helpful.

-- 
Regards
Yafang





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux