Re: [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 10 Sep 2025 13:44:39 -0700

On Tue, Sep 9, 2025 at 7:46 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> +/* Detecting whether a task can successfully allocate THP is unreliable because
> + * it may be influenced by system memory pressure. Instead of making the result
> + * dependent on unpredictable factors, we should simply check
> + * bpf_hook_thp_get_orders()'s return value, which is deterministic.
> + */
> +SEC("fexit/bpf_hook_thp_get_orders")
> +int BPF_PROG(thp_run, struct vm_area_struct *vma, u64 vma_flags, enum tva_type tva_type,
> +            unsigned long orders, int retval)
> +{

...

> +SEC("struct_ops/thp_get_order")
> +int BPF_PROG(alloc_in_khugepaged, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
> +            enum tva_type tva_type, unsigned long orders)
> +{

This is a bad idea to mix struct_ops logic with fentry/fexit style.
struct_ops hook will not be affected by compiler optimizations,
while fentry depends on a whim of compilers.
struct_ops can be scoped, while fentry is always global.
sched-ext already struggles with the later, since some scheds
need tracing data from other parts of the kernel and they cannot
be grouped together. All sorts of workarounds were proposed, but
no good solution in sight. So don't go this route for THP.
Make everything you need to be struct_ops based and/or pass
whatever extra data into these ops.

Also think of scoping for bpf-thp from the start.
Currently st_ops/thp_get_order is only one and it's global.
It's ok for prototypes and experiments, but not ok for landing upstream.
I think cgroup would a natural scope and different cgroups might
want their own bpf based THP hints. Once you do that, think through
how delegation of suggested order will propagate through hierarchy.

bpf-oom seems to be aligning toward the same design principles,
so don't reinvent the wheel.