Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment

David Hildenbrand <david@xxxxxxxxxx> · Thu, 17 Jul 2025 10:52:12 +0200

On 17.07.25 05:09, Yafang Shao wrote:
On Wed, Jul 16, 2025 at 6:42 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 08.06.25 09:35, Yafang Shao wrote:

Sorry for not replying earlier, I was caught up with all other stuff.

I still consider this a very interesting approach, although I think we
should think more about what a reasonable policy would look like
medoium-term (in particular, multiple THP sizes, not always falling back
to small pages if it means splitting excessively in the buddy etc.)

I find it difficult to understand why we introduced the mTHP sysfs
knobs instead of implementing automatic THP size switching within the
kernel. I'm skeptical about its practical utility in real-world
workloads.

In contrast, XFS large folio (AKA. File THP) can automatically select
orders between 0 and 9. Based on our verification, this feature has
proven genuinely useful for certain specific workloads—though it's not
yet perfect.

I suggest you do some digging about the history of these toggles and the 
plans for the future (automatic), there has been plenty of talk about 
all that.

[...]

- THP allocator

    int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);

    The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
    indicating whether THP allocation should be performed synchronously
    (current task) or asynchronously (khugepaged).

    The decision is based on the current task context, VMA flags, and TVA
    flags.

I think we should go one step further and actually get advises about the
orders (THP sizes) to use. It might be helpful if the program would have
access to system stats, to make an educated decision.

Given page fault information and system information, the program could
then decide which orders to try to allocate.

Yes, that aligns with my thoughts as well. For instance, we could
automate the decision-making process based on factors like PSI, memory
fragmentation, and other metrics. However, this logic could be
implemented within BPF programs—all we’d need is to extend the feature
by introducing a few kfuncs (also known as BPF helpers).

We discussed this yesterday at a THP upstream meeting, and what we 
should look into is:

(1) Having a callback like

unsigned int (*get_suggested_order)(.., bool in_pagefault);

Where we can provide some information about the fault (vma 
size/flags/anon_name), and whether we are in the page fault (or in 
khugepaged).

Maybe we want a bitmap of orders to try (fallback), not sure yet.

(2) Having some way to tag these callbacks as "this is absolutely 
unstable for now and can be changed as we please.".

One idea will be to use this mechanism as a way to easily prototype 
policies, and once we know that a policy works, start moving it into the 
core.

In general, the core, without a BPF program, should be able to continue 
providing a sane default behavior.

That means, one would query during page faults and during khugepaged,
which order one should try -- compared to our current approach of "start
with the largest order that is enabled and fits".

- THP reclaimer

    int (*reclaimer)(bool vma_madvised);

    The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
    determining whether memory reclamation is handled by the current task or
    kswapd.

Not sure about that, will have to look into the details.

Some workloads allocate all their memory during initialization and do
not require THP at runtime. For such cases, aggressively attempting
THP allocation is beneficial. However, other workloads may dynamically
allocate THP during execution—if these are latency-sensitive, we must
avoid introducing long allocation delays.

Given these differing requirements, the global
/sys/kernel/mm/transparent_hugepage/defrag setting is insufficient.
Instead, we should implement per-workload defrag policies to better
optimize performance based on individual application behavior.

We'll be very careful about the callbacks we will offer. Maybe the 
get_suggested_order() callback could itself make a decision and not 
suggest a high order if allocation would require comapction.

Initially, we should keep it simple and see what other callbacks to add 
/ how to extend get_suggested_order(), to cover these cases.

--
Cheers,

David / dhildenb