Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 5 May 2025 10:35:04 +0800

On Fri, May 2, 2025 at 9:04 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 02.05.25 14:18, Yafang Shao wrote:
> > On Fri, May 2, 2025 at 8:00 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
> >>
> >> On 2 May 2025, at 1:48, Yafang Shao wrote:
> >>
> >>> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
> >>> <gutierrez.asier@xxxxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>>
> >>>> On 4/30/2025 8:53 PM, Zi Yan wrote:
> >>>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> >>>>>
> >>>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> >>>>>>>>>> If it isn't, can you state why?
> >>>>>>>>>>
> >>>>>>>>>> The main difference is that you are saying it's in a container that you
> >>>>>>>>>> don't control.  Your plan is to violate the control the internal
> >>>>>>>>>> applications have over THP because you know better.  I'm not sure how
> >>>>>>>>>> people might feel about you messing with workloads,
> >>>>>>>>>
> >>>>>>>>> It’s not a mess. They have the option to deploy their services on
> >>>>>>>>> dedicated servers, but they would need to pay more for that choice.
> >>>>>>>>> This is a two-way decision.
> >>>>>>>>
> >>>>>>>> This implies you want a container-level way of controlling the setting
> >>>>>>>> and not a system service-level?
> >>>>>>>
> >>>>>>> Right. We want to control the THP per container.
> >>>>>>
> >>>>>> This does strike me as a reasonable usecase.
> >>>>>>
> >>>>>> I think there is consensus that in the long-term we want this stuff to
> >>>>>> just work and truly be transparent to userspace.
> >>>>>>
> >>>>>> In the short-to-medium term, however, there are still quite a few
> >>>>>> caveats. thp=always can significantly increase the memory footprint of
> >>>>>> sparse virtual regions. Huge allocations are not as cheap and reliable
> >>>>>> as we would like them to be, which for real production systems means
> >>>>>> having to make workload-specifcic choices and tradeoffs.
> >>>>>>
> >>>>>> There is ongoing work in these areas, but we do have a bit of a
> >>>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
> >>>>>> due to limitations in how they can be deployed. For example, we can't
> >>>>>> do thp=always on a DC node that runs arbitary combinations of jobs
> >>>>>> from a wide array of services. Some might benefit, some might hurt.
> >>>>>>
> >>>>>> Yet, it's much easier to improve the kernel based on exactly such
> >>>>>> production experience and data from real-world usecases. We can't
> >>>>>> improve the THP shrinker if we can't run THP.
> >>>>>>
> >>>>>> So I don't see it as overriding whoever wrote the software running
> >>>>>> inside the container. They don't know, and they shouldn't have to care
> >>>>>> about page sizes. It's about letting admins and kernel teams get
> >>>>>> started on using and experimenting with this stuff, given the very
> >>>>>> real constraints right now, so we can get the feedback necessary to
> >>>>>> improve the situation.
> >>>>>
> >>>>> Since you think it is reasonable to control THP at container-level,
> >>>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> >>>>> (Asier cc'd)
> >>>>>
> >>>>> In this patchset, Yafang uses BPF to adjust THP global configs based
> >>>>> on VMA, which does not look a good approach to me. WDYT?
> >>>>>
> >>>>>
> >>>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/
> >>>>>
> >>>>> --
> >>>>> Best Regards,
> >>>>> Yan, Zi
> >>>>
> >>>> Hi,
> >>>>
> >>>> I believe cgroup is a better approach for containers, since this
> >>>> approach can be easily integrated with the user space stack like
> >>>> containerd and kubernets, which use cgroup to control system resources.
> >>>
> >>> The integration of BPF with containerd and Kubernetes is emerging as a
> >>> clear trend.
> >>>
> >>>>
> >>>> However, I pointed out earlier, the approach I suggested has some
> >>>> flaws:
> >>>> 1. Potential polution of cgroup with a big number of knobs
> >>>
> >>> Right, the memcg maintainers once told me that introducing a new
> >>> cgroup file means committing to maintaining it indefinitely, as these
> >>> interface files are treated as part of the ABI.
> >>> In contrast, BPF kfuncs are considered an unstable API, giving you the
> >>> flexibility to modify them later if needed.
> >>>
> >>>> 2. Requires configuration by the admin
> >>>>
> >>>> Ideally, as Matthew W. mentioned, there should be an automatic system.
> >>>
> >>> Take Matthew’s XFS large folio feature as an example—it was enabled
> >>> automatically. A few years ago, when we upgraded to the 6.1.y stable
> >>> kernel, we noticed this new feature. Since it was enabled by default,
> >>> we assumed the author was confident in its stability. Unfortunately,
> >>> it led to severe issues in our production environment: servers crashed
> >>> randomly, and in some cases, we experienced data loss without
> >>> understanding the root cause.
> >>>
> >>> We began disabling various kernel configurations in an attempt to
> >>> isolate the issue, and eventually, the problem disappeared after
> >>> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
> >>> kernel version with THP disabled and had to restart hundreds of
> >>> thousands of production servers. It was a nightmare for both us and
> >>> our sysadmins.
> >>>
> >>> Last year, we discovered that the initial issue had been resolved by this patch:
> >>> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@xxxxxxxxx/.
> >>> We backported the fix and re-enabled XFS large folios—only to face a
> >>> new nightmare. One of our services began crashing sporadically with
> >>> core dumps. It took us several months to trace the issue back to the
> >>> re-enabled XFS large folio feature. Fortunately, we were able to
> >>> disable it using livepatch, avoiding another round of mass server
> >>> restarts. To this day, the root cause remains unknown. The good news
> >>> is that the issue appears to be resolved in the 6.12.y stable kernel.
> >>> We're still trying to bisect which commit fixed it, though progress is
> >>> slow because the issue is not reliably reproducible.
> >>
> >> This is a very wrong attitude towards open source projects. You sounded
> >> like, whether intended or not, Linux community should provide issue-free
> >> kernels and is responsible for fixing all issues. But that is wrong.
> >> Since you are using the kernel, you could help improve it like Kairong
> >> is doing instead of waiting for others to fix the issue.
> >>
> >>>
> >>> In theory, new features should be enabled automatically. But in
> >>> practice, every new feature should come with a tunable knob. That’s a
> >>> lesson we learned the hard way from this experience—and perhaps
> >>> Matthew did too.
> >>
> >> That means new features will not get enough testing. People like you
> >> will just simply disable all new features and wait for they are stable.
> >> It would never come without testing and bug fixes.
>

Hello David,

Thanks for your replyment.

> We do have the concept of EXPERIMENTAL kernel configs, that are either
> expected get removed completely ("always enabled") or get turned into
> actual long-term kernel options. But yeah, it's always tricky what we
> actually want to put behind such options.
>
> I mean, READ_ONLY_THP_FOR_FS is still around and still EXPERIMENTAL ...

READ_ONLY_THP_FOR_FS is not enabled in our 6.1 kernel, as we are
cautious about enabling any EXPERIMENTAL feature. XFS large folio
support operates independently of READ_ONLY_THP_FOR_FS. However, it is
automatically enabled when CONFIG_TRANSPARENT_HUGEPAGE is set, as seen
in the 6.1.y stable kernel mapping_large_folio_support().

>
> Distro kernels are usually very careful about what to backport and what
> to support. Once we (working for a distro) do backport + test, we
> usually find some additional things that upstream hasn't spotted yet: in
> particular, because some workloads are only run in that form on distro
> kernels. We also ran into some issues with large folios (e.g., me
> personally with s390x KVM guests) and trying our best to fix them.

We also worked on this. As you may recall, I previously fixed a large
folio bug, which was merged into the 6.1.y stable kernel [0].

[0]. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.1.y&id=a3f8ee15228c89ce3713ee7e8e82f6d8a13fdb4b

>
> It can be quite time consuming, so I can understand that not everybody
> has the time to invest into heavy debugging, especially if it's
> extremely hard to reproduce (or even corrupts data :( ).

Correct. If the vmcore is incomplete, it is nearly impossible to
reliably determine the root cause. The best approach is to isolate the
issue as quickly as possible.

>
> I agree that adding a toggle after the effects to work around issues is
> not the right approach. Introducing a EXPERIMENTAL toggle early because
> one suspects complicated interactions in a different story. It's
> absolutely not trivial to make that decision.

In this patchset, we are not introducing a toggle as a workaround.
Rather, the change reflects the fact that some workloads benefit from
THP while others are negatively impacted. Therefore, it makes sense to
enable THP selectively based on workload characteristics.

>
> >
> > Pardon me?
> > This discussion has taken such an unexpected turn that I don’t feel
> > the need to explain what I’ve contributed to the Linux community over
> > the past few years.
>
> I'm sure Zi Yan didn't mean to insult you. I would have phrased it as:
>
> "It's difficult to decide which toggles make sense. There is a fine line
> between adding a toggle and not getting people actually testing it to
> stabilize it vs. not adding a toggle and forcing people to test it and
> fix it/report issues."
>
> Ideally, we'd find most issue in the RC phase or at least shortly after.
>
> You've been active in the kernel for a long time, please don't feel like
> the community is not appreciating that.

Thank you for the clarification. I truly appreciate your patience and
thoroughness.

-- 
Regards
Yafang