Re: [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 19 Aug 2025 10:41:16 +0800

On Mon, Aug 18, 2025 at 10:35 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 18/08/2025 06:55, Yafang Shao wrote:
> > Background
> > ----------
> >
> > Our production servers consistently configure THP to "never" due to
> > historical incidents caused by its behavior. Key issues include:
> > - Increased Memory Consumption
> >   THP significantly raises overall memory usage, reducing available memory
> >   for workloads.
> >
> > - Latency Spikes
> >   Random latency spikes occur due to frequent memory compaction triggered
> >   by THP.
> >
> > - Lack of Fine-Grained Control
> >   THP tuning is globally configured, making it unsuitable for containerized
> >   environments. When multiple workloads share a host, enabling THP without
> >   per-workload control leads to unpredictable behavior.
> >
> > Due to these issues, administrators avoid switching to madvise or always
> > modes—unless per-workload THP control is implemented.
> >
> > To address this, we propose BPF-based THP policy for flexible adjustment.
> > Additionally, as David mentioned [0], this mechanism can also serve as a
> > policy prototyping tool (test policies via BPF before upstreaming them).
>
> Hi Yafang,
>
> A few points:
>
> The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem
> to be anywhere in the coverletter.

Oops, my bad.

>
> I am probably missing something over here, but the current version won't accomplish
> the usecase you have described at the start of the coverletter and are aiming for, right?
> i.e. THP global policy "never", but get hugepages on an madvise or always basis.

In "never" mode, THP allocation is entirely disabled (except via
MADV_COLLAPSE). However, we can achieve the same behavior—and
more—using a BPF program, even in "madvise" or "always" mode. Instead
of introducing a new THP mode, we dynamically enforce policy via BPF.

Deployment Steps in our production servers:

1. Initial Setup:
- Set THP mode to "never" (disabling THP by default).
- Attach the BPF program and pin the BPF maps and links.
- Pinning ensures persistence (like a kernel module), preventing
disruption under system pressure.
- A THP whitelist map tracks allowed cgroups (initially empty → no THP
allocations).

2. Enable THP Control:
- Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).

3. Dynamic Management:
- To permit THP for a cgroup, add its ID to the whitelist map.
- To revoke permission, remove the cgroup ID from the map.
- The BPF program can be updated live (policy adjustments require no
task interruption).

> I think there was a new THP mode introduced in some earlier revision where you can switch to it
> from "never" and then you can use bpf programs with it, but its not in this revision?
> It might be useful to add your specific usecase as a selftest.
>
> Do we have some numbers on what the overhead of calling the bpf program is in the
> pagefault path as its a critical path?

In our current implementation, THP allocation occurs during the page
fault path. As such, I have not yet evaluated performance for this
specific case.
The overhead is expected to be workload-dependent, primarily influenced by:
- Memory availability: The presence (or absence) of higher-order free pages
- System pressure: Contention for memory compaction, NUMA balancing,
or direct reclaim

>
> I remember there was a discussion on this in the earlier revisions, and I have mentioned this in patch 1
> as well, but I think making this feature experimental with warnings might not be a great idea.

The experimental status of this feature was requested by David and
Lorenzo, who likely have specific technical considerations behind this
requirement.

> It could lead to 2 paths:
> - people don't deploy this in their fleet because its marked as experimental and they dont want
> their machines to break once they upgrade the kernel and this is changed. We will have a difficult
> time improving upon this as this is just going to be used for prototyping and won't be driven by
> production data.
> - people are careless and deploy it in on their production machines, and you get reports that this
> has broken after kernel upgrades (despite being marked as experimental :)).
> This is just my opinion (which can be wrong :)), but I think we should try and have this merged
> as a stable interface that won't change. There might be bugs reported down the line, but I am hoping
> we can get the interface of get_suggested_order right in the first implementation that gets merged?

We may eventually remove the experimental status or deprecate this
feature entirely, depending on its adoption. However, the first
critical step is to make it available for broader usage and
evaluation.

-- 
Regards
Yafang