Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 2 May 2025 13:48:15 +0800

On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
<gutierrez.asier@xxxxxxxxxxxxxxxxxxx> wrote:
>
>
> On 4/30/2025 8:53 PM, Zi Yan wrote:
> > On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> >
> >> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> >>>>>> If it isn't, can you state why?
> >>>>>>
> >>>>>> The main difference is that you are saying it's in a container that you
> >>>>>> don't control.  Your plan is to violate the control the internal
> >>>>>> applications have over THP because you know better.  I'm not sure how
> >>>>>> people might feel about you messing with workloads,
> >>>>>
> >>>>> It’s not a mess. They have the option to deploy their services on
> >>>>> dedicated servers, but they would need to pay more for that choice.
> >>>>> This is a two-way decision.
> >>>>
> >>>> This implies you want a container-level way of controlling the setting
> >>>> and not a system service-level?
> >>>
> >>> Right. We want to control the THP per container.
> >>
> >> This does strike me as a reasonable usecase.
> >>
> >> I think there is consensus that in the long-term we want this stuff to
> >> just work and truly be transparent to userspace.
> >>
> >> In the short-to-medium term, however, there are still quite a few
> >> caveats. thp=always can significantly increase the memory footprint of
> >> sparse virtual regions. Huge allocations are not as cheap and reliable
> >> as we would like them to be, which for real production systems means
> >> having to make workload-specifcic choices and tradeoffs.
> >>
> >> There is ongoing work in these areas, but we do have a bit of a
> >> chicken-and-egg problem: on the one hand, huge page adoption is slow
> >> due to limitations in how they can be deployed. For example, we can't
> >> do thp=always on a DC node that runs arbitary combinations of jobs
> >> from a wide array of services. Some might benefit, some might hurt.
> >>
> >> Yet, it's much easier to improve the kernel based on exactly such
> >> production experience and data from real-world usecases. We can't
> >> improve the THP shrinker if we can't run THP.
> >>
> >> So I don't see it as overriding whoever wrote the software running
> >> inside the container. They don't know, and they shouldn't have to care
> >> about page sizes. It's about letting admins and kernel teams get
> >> started on using and experimenting with this stuff, given the very
> >> real constraints right now, so we can get the feedback necessary to
> >> improve the situation.
> >
> > Since you think it is reasonable to control THP at container-level,
> > namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> > (Asier cc'd)
> >
> > In this patchset, Yafang uses BPF to adjust THP global configs based
> > on VMA, which does not look a good approach to me. WDYT?
> >
> >
> > [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/
> >
> > --
> > Best Regards,
> > Yan, Zi
>
> Hi,
>
> I believe cgroup is a better approach for containers, since this
> approach can be easily integrated with the user space stack like
> containerd and kubernets, which use cgroup to control system resources.

The integration of BPF with containerd and Kubernetes is emerging as a
clear trend.

>
> However, I pointed out earlier, the approach I suggested has some
> flaws:
> 1. Potential polution of cgroup with a big number of knobs

Right, the memcg maintainers once told me that introducing a new
cgroup file means committing to maintaining it indefinitely, as these
interface files are treated as part of the ABI.
In contrast, BPF kfuncs are considered an unstable API, giving you the
flexibility to modify them later if needed.

> 2. Requires configuration by the admin
>
> Ideally, as Matthew W. mentioned, there should be an automatic system.

Take Matthew’s XFS large folio feature as an example—it was enabled
automatically. A few years ago, when we upgraded to the 6.1.y stable
kernel, we noticed this new feature. Since it was enabled by default,
we assumed the author was confident in its stability. Unfortunately,
it led to severe issues in our production environment: servers crashed
randomly, and in some cases, we experienced data loss without
understanding the root cause.

We began disabling various kernel configurations in an attempt to
isolate the issue, and eventually, the problem disappeared after
disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
kernel version with THP disabled and had to restart hundreds of
thousands of production servers. It was a nightmare for both us and
our sysadmins.

Last year, we discovered that the initial issue had been resolved by this patch:
https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@xxxxxxxxx/.
We backported the fix and re-enabled XFS large folios—only to face a
new nightmare. One of our services began crashing sporadically with
core dumps. It took us several months to trace the issue back to the
re-enabled XFS large folio feature. Fortunately, we were able to
disable it using livepatch, avoiding another round of mass server
restarts. To this day, the root cause remains unknown. The good news
is that the issue appears to be resolved in the 6.12.y stable kernel.
We're still trying to bisect which commit fixed it, though progress is
slow because the issue is not reliably reproducible.

In theory, new features should be enabled automatically. But in
practice, every new feature should come with a tunable knob. That’s a
lesson we learned the hard way from this experience—and perhaps
Matthew did too.

>
> Anyway, regarding containers, I believe cgroup is a good approach
> given that the admin or the container management system uses cgroups
> to set up the containers.
>
> --
> Asier Gutierrez
> Huawei
>

-- 
Regards
Yafang