Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Zi Yan <ziy@xxxxxxxxxx> · Wed, 30 Apr 2025 13:53:41 -0400

On 30 Apr 2025, at 13:45, Johannes Weiner wrote:

> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>> If it isn't, can you state why?
>>>>>
>>>>> The main difference is that you are saying it's in a container that you
>>>>> don't control.  Your plan is to violate the control the internal
>>>>> applications have over THP because you know better.  I'm not sure how
>>>>> people might feel about you messing with workloads,
>>>>
>>>> It’s not a mess. They have the option to deploy their services on
>>>> dedicated servers, but they would need to pay more for that choice.
>>>> This is a two-way decision.
>>>
>>> This implies you want a container-level way of controlling the setting
>>> and not a system service-level?
>>
>> Right. We want to control the THP per container.
>
> This does strike me as a reasonable usecase.
>
> I think there is consensus that in the long-term we want this stuff to
> just work and truly be transparent to userspace.
>
> In the short-to-medium term, however, there are still quite a few
> caveats. thp=always can significantly increase the memory footprint of
> sparse virtual regions. Huge allocations are not as cheap and reliable
> as we would like them to be, which for real production systems means
> having to make workload-specifcic choices and tradeoffs.
>
> There is ongoing work in these areas, but we do have a bit of a
> chicken-and-egg problem: on the one hand, huge page adoption is slow
> due to limitations in how they can be deployed. For example, we can't
> do thp=always on a DC node that runs arbitary combinations of jobs
> from a wide array of services. Some might benefit, some might hurt.
>
> Yet, it's much easier to improve the kernel based on exactly such
> production experience and data from real-world usecases. We can't
> improve the THP shrinker if we can't run THP.
>
> So I don't see it as overriding whoever wrote the software running
> inside the container. They don't know, and they shouldn't have to care
> about page sizes. It's about letting admins and kernel teams get
> started on using and experimenting with this stuff, given the very
> real constraints right now, so we can get the feedback necessary to
> improve the situation.

Since you think it is reasonable to control THP at container-level,
namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
(Asier cc'd)

In this patchset, Yafang uses BPF to adjust THP global configs based
on VMA, which does not look a good approach to me. WDYT?

[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/

--
Best Regards,
Yan, Zi