On Tue, May 20, 2025 at 5:44 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > Conclusion > > ---------- > > > > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the > > most effective solution for our requirements. This approach represents a > > small but meaningful step toward making THP truly usable—and manageable—in > > production environments. > A new "bpf" mode sounds way too special. Alternatively, we could simply hook 'madvise' to define a BPF-based policy. > > We currently have: > > never -> never > madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE > always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE If BPF had been invented before THP, we likely would have only three modes—without PR_SET_THP_DISABLE, MADV_NOHUGEPAGE, or MADV_HUGEPAGE;-) never -> never user -> user defined per task or per vma THP mode selector, based on BPF We can select "never" or "always" for a specific task or vma The API is as follows, bpf->per_task_mode_selector(task); bpf->per_vma_mode_selecor(vma); always -> always However, it’s not too late to introduce a new BPF-based mode for THP, especially since future adjustments to THP policies are still expected. Regardless of the specific policy, two fundamental principles apply: 1. Selective Benefit: Some tasks benefit from THP, while others do not. 2. Conditional Safety: THP allocation is safe under certain conditions but not others. Given these constraints, we could abstract stable APIs that allow users to define custom THP policies tailored to their needs. > > Whatever new mode we add, it should honor PR_SET_THP_DISABLE + > MADV_NOHUGEPAGE. Yes, the BPF only selects different THP modes for different tasks, nothing else won't be changed. > > So, if we want another way to enable things, it would live between > "never" and "madvise". Yes, BPF only selects the appropriate THP mode for each task—nothing else is modified. > > I'm wondering how we could make that generic: likely we want this new > mechanism to *not* be triggerable by the process itself (madvise). > > I am not convinced bpf is the answer here ... I believe the key insight is that we should define a generic, stable API for BPF-based THP mode selection. -- Regards Yafang