On Tue, May 20, 2025 at 08:06:21PM +0800, Yafang Shao wrote: > On Tue, May 20, 2025 at 5:49 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > > > On Tue, May 20, 2025 at 11:43:11AM +0200, David Hildenbrand wrote: > > > > Conclusion > > > > ---------- > > > > > > > > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the > > > > most effective solution for our requirements. This approach represents a > > > > small but meaningful step toward making THP truly usable—and manageable—in > > > > production environments. > > > A new "bpf" mode sounds way too special. > > > > > > We currently have: > > > > > > never -> never > > > madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE > > > always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE > > > > > > Whatever new mode we add, it should honor PR_SET_THP_DISABLE + > > > MADV_NOHUGEPAGE. > > > > > > So, if we want another way to enable things, it would live between "never" > > > and "madvise". > > > > > > I'm wondering how we could make that generic: likely we want this new > > > mechanism to *not* be triggerable by the process itself (madvise). > > > > > > I am not convinced bpf is the answer here ... > > > > Agreed. > > > > I am also very concerned with us inserting BPF bits here - are we not then > > ensuring that we cannot in any way move towards a future where we > > 'automagically' determine what to do? > > > > I don't know what is claimed about BPF, but it strikes me that we're > > establishing a permanent uABI (uAPI?) if we do that and essentially > > promising that THP will continue to operate in a fashion similar to how it > > does now. > > > > While BPF is a wonderful technology, I thik we have to be very very careful > > about inserting it in places that consist of -implementation details- that > > we in mm already are planning to move away from. > > > > It's one thing adding BPF in the oomk (simple interface, unlikely to > > change, doesn't really constrain us) or the scheduler (again the hooks are > > by nature reasonably stable), it's quite another sticking it in the heart > > of a part of mm that is undergoing _constant_ change, partly as evidenced > > by the sheer number of series related to THP that are currently on-list. > > > > So while BPF may be the best solution for your needs _right now_, we need > > be concerned with how things affect the kernel in the future. > > > > I think we really do have to tread very carefully here. > > I totally agree with you that the key point here is how to define the > API. As I replied to David, I believe we have two fundamental > principles to adjust the THP policies: > 1. Selective Benefit: Some tasks benefit from THP, while others do not. > 2. Conditional Safety: THP allocation is safe under certain conditions > but not others. > > Therefore, I believe we can define these APIs based on the established > principles - everything else constitutes implementation details, even > if core MM internals need to change. But if we're looking to make the concept of THP go away, we really need to go further than this. The second we have 'bpf program that figures out whether THP should be used' we are permanently tied to the idea of THP on/off being a thing. I mean any future stuff that makes THP more automagic will probably involve having new modes for the legacy THP /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled But if people are super reliant on this stuff it's potentially really limiting. I think you said in another post here that you were toying with the notion of exposing somehow the madvise() interface and having that be the 'stable API' of sorts? That definitely sounds more sensible than something that very explicitly interacts with THP. Of course we have Usama's series and my proposed series for extending process_madvise() along those lines also. > > -- > Regards > Yafang