On Thu, Jul 17, 2025 at 10:52:12AM +0200, David Hildenbrand wrote: > On 17.07.25 05:09, Yafang Shao wrote: > > On Wed, Jul 16, 2025 at 6:42 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > > > > - THP allocator > > > > > > > > int (*allocator)(unsigned long vm_flags, unsigned long tva_flags); > > > > > > > > The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED, > > > > indicating whether THP allocation should be performed synchronously > > > > (current task) or asynchronously (khugepaged). > > > > > > > > The decision is based on the current task context, VMA flags, and TVA > > > > flags. > > > > > > I think we should go one step further and actually get advises about the > > > orders (THP sizes) to use. It might be helpful if the program would have > > > access to system stats, to make an educated decision. > > > > > > Given page fault information and system information, the program could > > > then decide which orders to try to allocate. > > > > Yes, that aligns with my thoughts as well. For instance, we could > > automate the decision-making process based on factors like PSI, memory > > fragmentation, and other metrics. However, this logic could be > > implemented within BPF programs—all we’d need is to extend the feature > > by introducing a few kfuncs (also known as BPF helpers). > > We discussed this yesterday at a THP upstream meeting, and what we should > look into is: > > (1) Having a callback like > > unsigned int (*get_suggested_order)(.., bool in_pagefault); > > Where we can provide some information about the fault (vma > size/flags/anon_name), and whether we are in the page fault (or in > khugepaged). > > Maybe we want a bitmap of orders to try (fallback), not sure yet. Ah I mentioned fallback below then noticed you mentioned here :) > > (2) Having some way to tag these callbacks as "this is absolutely unstable > for now and can be changed as we please.". > > One idea will be to use this mechanism as a way to easily prototype > policies, and once we know that a policy works, start moving it into the > core. > > In general, the core, without a BPF program, should be able to continue > providing a sane default behavior. I have warmed to this approach overall and I think one thing that was very clearly positive about this that came out of the call was the idea that we can rapidly prototype different ideas. I think a key to all this is ensuring that we: - Mark this interface very clearly unstable to begin with. - Keep the interface as simple as possible. I think perhaps the more challenging thing here will be providing the right amount of information to the caller to make decisions. Also precisely how we use this too - obviously we need to be _trying_ to allocate at the requested order but should that fail allocate less but precisely how we do the fallback is something to think about. I think generally this is the current best way forward before an automagic world... which is a very long-term project. > > > > > > > > > That means, one would query during page faults and during khugepaged, > > > which order one should try -- compared to our current approach of "start > > > with the largest order that is enabled and fits". > > > > > > > > > > > - THP reclaimer > > > > > > > > int (*reclaimer)(bool vma_madvised); > > > > > > > > The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD, > > > > determining whether memory reclamation is handled by the current task or > > > > kswapd. > > > > > > Not sure about that, will have to look into the details. > > > > Some workloads allocate all their memory during initialization and do > > not require THP at runtime. For such cases, aggressively attempting > > THP allocation is beneficial. However, other workloads may dynamically > > allocate THP during execution—if these are latency-sensitive, we must > > avoid introducing long allocation delays. > > > > Given these differing requirements, the global > > /sys/kernel/mm/transparent_hugepage/defrag setting is insufficient. > > Instead, we should implement per-workload defrag policies to better > > optimize performance based on individual application behavior. > > We'll be very careful about the callbacks we will offer. Maybe the > get_suggested_order() callback could itself make a decision and not suggest > a high order if allocation would require comapction. > > Initially, we should keep it simple and see what other callbacks to add / > how to extend get_suggested_order(), to cover these cases. Yes, caution vital here. > > -- > Cheers, > > David / dhildenb >