Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 30 Apr 2025 13:45:21 -0400

On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> > > > If it isn't, can you state why?
> > > >
> > > > The main difference is that you are saying it's in a container that you
> > > > don't control.  Your plan is to violate the control the internal
> > > > applications have over THP because you know better.  I'm not sure how
> > > > people might feel about you messing with workloads,
> > >
> > > It’s not a mess. They have the option to deploy their services on
> > > dedicated servers, but they would need to pay more for that choice.
> > > This is a two-way decision.
> >
> > This implies you want a container-level way of controlling the setting
> > and not a system service-level?
> 
> Right. We want to control the THP per container.

This does strike me as a reasonable usecase.

I think there is consensus that in the long-term we want this stuff to
just work and truly be transparent to userspace.

In the short-to-medium term, however, there are still quite a few
caveats. thp=always can significantly increase the memory footprint of
sparse virtual regions. Huge allocations are not as cheap and reliable
as we would like them to be, which for real production systems means
having to make workload-specifcic choices and tradeoffs.

There is ongoing work in these areas, but we do have a bit of a
chicken-and-egg problem: on the one hand, huge page adoption is slow
due to limitations in how they can be deployed. For example, we can't
do thp=always on a DC node that runs arbitary combinations of jobs
from a wide array of services. Some might benefit, some might hurt.

Yet, it's much easier to improve the kernel based on exactly such
production experience and data from real-world usecases. We can't
improve the THP shrinker if we can't run THP.

So I don't see it as overriding whoever wrote the software running
inside the container. They don't know, and they shouldn't have to care
about page sizes. It's about letting admins and kernel teams get
started on using and experimenting with this stuff, given the very
real constraints right now, so we can get the feedback necessary to
improve the situation.