Re: [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 19 Aug 2025 19:33:12 +0800

On Tue, Aug 19, 2025 at 6:44 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 19/08/2025 03:41, Yafang Shao wrote:
> > On Mon, Aug 18, 2025 at 10:35 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 18/08/2025 06:55, Yafang Shao wrote:
> >>> Background
> >>> ----------
> >>>
> >>> Our production servers consistently configure THP to "never" due to
> >>> historical incidents caused by its behavior. Key issues include:
> >>> - Increased Memory Consumption
> >>>   THP significantly raises overall memory usage, reducing available memory
> >>>   for workloads.
> >>>
> >>> - Latency Spikes
> >>>   Random latency spikes occur due to frequent memory compaction triggered
> >>>   by THP.
> >>>
> >>> - Lack of Fine-Grained Control
> >>>   THP tuning is globally configured, making it unsuitable for containerized
> >>>   environments. When multiple workloads share a host, enabling THP without
> >>>   per-workload control leads to unpredictable behavior.
> >>>
> >>> Due to these issues, administrators avoid switching to madvise or always
> >>> modes—unless per-workload THP control is implemented.
> >>>
> >>> To address this, we propose BPF-based THP policy for flexible adjustment.
> >>> Additionally, as David mentioned [0], this mechanism can also serve as a
> >>> policy prototyping tool (test policies via BPF before upstreaming them).
> >>
> >> Hi Yafang,
> >>
> >> A few points:
> >>
> >> The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem
> >> to be anywhere in the coverletter.
> >
> > Oops, my bad.
> >
> >>
> >> I am probably missing something over here, but the current version won't accomplish
> >> the usecase you have described at the start of the coverletter and are aiming for, right?
> >> i.e. THP global policy "never", but get hugepages on an madvise or always basis.
> >
> > In "never" mode, THP allocation is entirely disabled (except via
> > MADV_COLLAPSE). However, we can achieve the same behavior—and
> > more—using a BPF program, even in "madvise" or "always" mode. Instead
> > of introducing a new THP mode, we dynamically enforce policy via BPF.
> >
> > Deployment Steps in our production servers:
> >
> > 1. Initial Setup:
> > - Set THP mode to "never" (disabling THP by default).
> > - Attach the BPF program and pin the BPF maps and links.
> > - Pinning ensures persistence (like a kernel module), preventing
> > disruption under system pressure.
> > - A THP whitelist map tracks allowed cgroups (initially empty → no THP
> > allocations).
> >
> > 2. Enable THP Control:
> > - Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).
>
>
> Ah ok, so I was missing this part. With this solution you will still have to change
> the system policy to madvise or always, and then basically disable THP for everyone apart
> from the cgroups that want it?

Right.

>
> >
> > 3. Dynamic Management:
> > - To permit THP for a cgroup, add its ID to the whitelist map.
> > - To revoke permission, remove the cgroup ID from the map.
> > - The BPF program can be updated live (policy adjustments require no
> > task interruption).
> >
> >> I think there was a new THP mode introduced in some earlier revision where you can switch to it
> >> from "never" and then you can use bpf programs with it, but its not in this revision?
> >> It might be useful to add your specific usecase as a selftest.
> >>
> >> Do we have some numbers on what the overhead of calling the bpf program is in the
> >> pagefault path as its a critical path?
> >
> > In our current implementation, THP allocation occurs during the page
> > fault path. As such, I have not yet evaluated performance for this
> > specific case.
> > The overhead is expected to be workload-dependent, primarily influenced by:
> > - Memory availability: The presence (or absence) of higher-order free pages
> > - System pressure: Contention for memory compaction, NUMA balancing,
> > or direct reclaim
> >
>
> Yes, I think might be worth seeing if perf indicates that you are spending more time
> in __handle_mm_fault with this series + bpf program attached compared to without?

I will test it.

-- 
Regards
Yafang