Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Zi Yan <ziy@xxxxxxxxxx> · Fri, 02 May 2025 08:00:12 -0400

On 2 May 2025, at 1:48, Yafang Shao wrote:

> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
> <gutierrez.asier@xxxxxxxxxxxxxxxxxxx> wrote:
>>
>>
>> On 4/30/2025 8:53 PM, Zi Yan wrote:
>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
>>>
>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>>>>> If it isn't, can you state why?
>>>>>>>>
>>>>>>>> The main difference is that you are saying it's in a container that you
>>>>>>>> don't control.  Your plan is to violate the control the internal
>>>>>>>> applications have over THP because you know better.  I'm not sure how
>>>>>>>> people might feel about you messing with workloads,
>>>>>>>
>>>>>>> It’s not a mess. They have the option to deploy their services on
>>>>>>> dedicated servers, but they would need to pay more for that choice.
>>>>>>> This is a two-way decision.
>>>>>>
>>>>>> This implies you want a container-level way of controlling the setting
>>>>>> and not a system service-level?
>>>>>
>>>>> Right. We want to control the THP per container.
>>>>
>>>> This does strike me as a reasonable usecase.
>>>>
>>>> I think there is consensus that in the long-term we want this stuff to
>>>> just work and truly be transparent to userspace.
>>>>
>>>> In the short-to-medium term, however, there are still quite a few
>>>> caveats. thp=always can significantly increase the memory footprint of
>>>> sparse virtual regions. Huge allocations are not as cheap and reliable
>>>> as we would like them to be, which for real production systems means
>>>> having to make workload-specifcic choices and tradeoffs.
>>>>
>>>> There is ongoing work in these areas, but we do have a bit of a
>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
>>>> due to limitations in how they can be deployed. For example, we can't
>>>> do thp=always on a DC node that runs arbitary combinations of jobs
>>>> from a wide array of services. Some might benefit, some might hurt.
>>>>
>>>> Yet, it's much easier to improve the kernel based on exactly such
>>>> production experience and data from real-world usecases. We can't
>>>> improve the THP shrinker if we can't run THP.
>>>>
>>>> So I don't see it as overriding whoever wrote the software running
>>>> inside the container. They don't know, and they shouldn't have to care
>>>> about page sizes. It's about letting admins and kernel teams get
>>>> started on using and experimenting with this stuff, given the very
>>>> real constraints right now, so we can get the feedback necessary to
>>>> improve the situation.
>>>
>>> Since you think it is reasonable to control THP at container-level,
>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
>>> (Asier cc'd)
>>>
>>> In this patchset, Yafang uses BPF to adjust THP global configs based
>>> on VMA, which does not look a good approach to me. WDYT?
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>>
>> Hi,
>>
>> I believe cgroup is a better approach for containers, since this
>> approach can be easily integrated with the user space stack like
>> containerd and kubernets, which use cgroup to control system resources.
>
> The integration of BPF with containerd and Kubernetes is emerging as a
> clear trend.
>
>>
>> However, I pointed out earlier, the approach I suggested has some
>> flaws:
>> 1. Potential polution of cgroup with a big number of knobs
>
> Right, the memcg maintainers once told me that introducing a new
> cgroup file means committing to maintaining it indefinitely, as these
> interface files are treated as part of the ABI.
> In contrast, BPF kfuncs are considered an unstable API, giving you the
> flexibility to modify them later if needed.
>
>> 2. Requires configuration by the admin
>>
>> Ideally, as Matthew W. mentioned, there should be an automatic system.
>
> Take Matthew’s XFS large folio feature as an example—it was enabled
> automatically. A few years ago, when we upgraded to the 6.1.y stable
> kernel, we noticed this new feature. Since it was enabled by default,
> we assumed the author was confident in its stability. Unfortunately,
> it led to severe issues in our production environment: servers crashed
> randomly, and in some cases, we experienced data loss without
> understanding the root cause.
>
> We began disabling various kernel configurations in an attempt to
> isolate the issue, and eventually, the problem disappeared after
> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
> kernel version with THP disabled and had to restart hundreds of
> thousands of production servers. It was a nightmare for both us and
> our sysadmins.
>
> Last year, we discovered that the initial issue had been resolved by this patch:
> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@xxxxxxxxx/.
> We backported the fix and re-enabled XFS large folios—only to face a
> new nightmare. One of our services began crashing sporadically with
> core dumps. It took us several months to trace the issue back to the
> re-enabled XFS large folio feature. Fortunately, we were able to
> disable it using livepatch, avoiding another round of mass server
> restarts. To this day, the root cause remains unknown. The good news
> is that the issue appears to be resolved in the 6.12.y stable kernel.
> We're still trying to bisect which commit fixed it, though progress is
> slow because the issue is not reliably reproducible.

This is a very wrong attitude towards open source projects. You sounded
like, whether intended or not, Linux community should provide issue-free
kernels and is responsible for fixing all issues. But that is wrong.
Since you are using the kernel, you could help improve it like Kairong
is doing instead of waiting for others to fix the issue.

>
> In theory, new features should be enabled automatically. But in
> practice, every new feature should come with a tunable knob. That’s a
> lesson we learned the hard way from this experience—and perhaps
> Matthew did too.

That means new features will not get enough testing. People like you
will just simply disable all new features and wait for they are stable.
It would never come without testing and bug fixes.

--
Best Regards,
Yan, Zi