Re: [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy

Yafang Shao <laoar.shao@xxxxxxxxx> · Thu, 11 Sep 2025 10:31:02 +0800

On Thu, Sep 11, 2025 at 4:44 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Tue, Sep 9, 2025 at 7:46 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> >
> > +/* Detecting whether a task can successfully allocate THP is unreliable because
> > + * it may be influenced by system memory pressure. Instead of making the result
> > + * dependent on unpredictable factors, we should simply check
> > + * bpf_hook_thp_get_orders()'s return value, which is deterministic.
> > + */
> > +SEC("fexit/bpf_hook_thp_get_orders")
> > +int BPF_PROG(thp_run, struct vm_area_struct *vma, u64 vma_flags, enum tva_type tva_type,
> > +            unsigned long orders, int retval)
> > +{
>
> ...
>
> > +SEC("struct_ops/thp_get_order")
> > +int BPF_PROG(alloc_in_khugepaged, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
> > +            enum tva_type tva_type, unsigned long orders)
> > +{
>
> This is a bad idea to mix struct_ops logic with fentry/fexit style.
> struct_ops hook will not be affected by compiler optimizations,
> while fentry depends on a whim of compilers.
> struct_ops can be scoped, while fentry is always global.
> sched-ext already struggles with the later, since some scheds
> need tracing data from other parts of the kernel and they cannot
> be grouped together. All sorts of workarounds were proposed, but
> no good solution in sight. So don't go this route for THP.
> Make everything you need to be struct_ops based and/or pass
> whatever extra data into these ops.

will change it.

>
> Also think of scoping for bpf-thp from the start.
> Currently st_ops/thp_get_order is only one and it's global.
> It's ok for prototypes and experiments, but not ok for landing upstream.
> I think cgroup would a natural scope and different cgroups might
> want their own bpf based THP hints. Once you do that, think through
> how delegation of suggested order will propagate through hierarchy.

+ Tejun

As Johannes Weiner previously explained [[0]], cgroups are designed as
nested hierarchies for partitioning resources. They are a poor fit for
enforcing arbitrary, non-hierarchical policies.

: Cgroups are for nested trees dividing up resources. They're not a good
: fit for arbitrary, non-hierarchical policy settings.

[0] https://lore.kernel.org/linux-mm/20250430175954.GD2020@xxxxxxxxxxx/

The THP  policy is a quintessential example of such an arbitrary
setting. Even within a single cgroup, it is often necessary to enable
THP for performance-critical tasks while disabling it for others to
avoid latency spikes. Implementing this policy through a cgroup
interface that propagates hierarchically would eliminate the crucial
ability to configure it on a per-task basis.

While the bpf-thp mechanism has a global scope, this does not limit
its application to a single system-wide policy. In contrast to a
hierarchical cgroup-based setting, bpf-thp offers the flexibility to
set policies per-task, per-cgroup, or globally. Fundamentally, it is a
more powerful variant of prctl(), not a variant of cgroup interface
file.

>
> bpf-oom seems to be aligning toward the same design principles,
> so don't reinvent the wheel.

Since bpf-oom's role is to select a task to kill from within **a
defined group of tasks**, it is inherently well-suited for
cgroup-based management.

-- 
Regards
Yafang