Re: [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 30 Jul 2025 10:31:37 +0800

On Tue, Jul 29, 2025 at 11:08 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
>
> On 29 Jul 2025, at 5:18, Yafang Shao wrote:
>
> > Background
> > ----------
> >
> > Our production servers consistently configure THP to "never" due to
> > historical incidents caused by its behavior. Key issues include:
> > - Increased Memory Consumption
> >   THP significantly raises overall memory usage, reducing available memory
> >   for workloads.
> >
> > - Latency Spikes
> >   Random latency spikes occur due to frequent memory compaction triggered
> >   by THP.
> >
> > - Lack of Fine-Grained Control
> >   THP tuning is globally configured, making it unsuitable for containerized
> >   environments. When multiple workloads share a host, enabling THP without
> >   per-workload control leads to unpredictable behavior.
> >
> > Due to these issues, administrators avoid switching to madvise or always
> > modes—unless per-workload THP control is implemented.
> >
> > To address this, we propose BPF-based THP policy for flexible adjustment.
> > Additionally, as David mentioned [0], this mechanism can also serve as a
>
> The link to [0] is missing. :)

I forgot to add it:
https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@xxxxxxxxxx/

>
> > policy prototyping tool (test policies via BPF before upstreaming them).
> >
> > Proposed Solution
> > -----------------
> >
> > As suggested by David [0], we introduce a new BPF interface:
> >
> > /**
> >  * @get_suggested_order: Get the suggested highest THP order for allocation
> >  * @mm: mm_struct associated with the THP allocation
> >  * @tva_flags: TVA flags for current context
> >  *             %TVA_IN_PF: Set when in page fault context
> >  *             Other flags: Reserved for future use
> >  * @order: The highest order being considered for this THP allocation.
> >  *         %PUD_ORDER for PUD-mapped allocations
>
> There is no PUD THP yet and the highest THP order is PMD_ORDER. It is better
> to remove the line above to avoid confusion.

Thanks for catching that. I’ll remove it.

>
> >  *         %PMD_ORDER for PMD-mapped allocations
> >  *         %PMD_ORDER - 1 for mTHP allocations
> >  *
> >  * Rerurn: Suggested highest THP order to use for allocation. The returned
> >  * order will never exceed the input @order value.
> >  */
> > int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order);
> >
> > This interface:
> > - Supports both use cases (per-workload tuning + policy prototyping).
> > - Can be extended with BPF helpers (e.g., for memory pressure awareness).
>
> IIRC, your initial RFC works at VMA level, but this patch targets mm level.
> Is mm sufficient for your use case?

Yes, mm is sufficient for our use cases.
We've already deployed a variant of this patchset in our production
environment, and it has been performing well under our workloads.

> Are you planning to extend the
> BFP interface to VMA in the future? Just curious.

Our use cases don’t currently require the VMA.
We can add it later if a clear need arises.

--
Regards

Yafang