Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Gutierrez Asier <gutierrez.asier@xxxxxxxxxxxxxxxxxxx> · Thu, 1 May 2025 22:36:23 +0300

On 4/30/2025 8:53 PM, Zi Yan wrote:
> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> 
>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>>> If it isn't, can you state why?
>>>>>>
>>>>>> The main difference is that you are saying it's in a container that you
>>>>>> don't control.  Your plan is to violate the control the internal
>>>>>> applications have over THP because you know better.  I'm not sure how
>>>>>> people might feel about you messing with workloads,
>>>>>
>>>>> It’s not a mess. They have the option to deploy their services on
>>>>> dedicated servers, but they would need to pay more for that choice.
>>>>> This is a two-way decision.
>>>>
>>>> This implies you want a container-level way of controlling the setting
>>>> and not a system service-level?
>>>
>>> Right. We want to control the THP per container.
>>
>> This does strike me as a reasonable usecase.
>>
>> I think there is consensus that in the long-term we want this stuff to
>> just work and truly be transparent to userspace.
>>
>> In the short-to-medium term, however, there are still quite a few
>> caveats. thp=always can significantly increase the memory footprint of
>> sparse virtual regions. Huge allocations are not as cheap and reliable
>> as we would like them to be, which for real production systems means
>> having to make workload-specifcic choices and tradeoffs.
>>
>> There is ongoing work in these areas, but we do have a bit of a
>> chicken-and-egg problem: on the one hand, huge page adoption is slow
>> due to limitations in how they can be deployed. For example, we can't
>> do thp=always on a DC node that runs arbitary combinations of jobs
>> from a wide array of services. Some might benefit, some might hurt.
>>
>> Yet, it's much easier to improve the kernel based on exactly such
>> production experience and data from real-world usecases. We can't
>> improve the THP shrinker if we can't run THP.
>>
>> So I don't see it as overriding whoever wrote the software running
>> inside the container. They don't know, and they shouldn't have to care
>> about page sizes. It's about letting admins and kernel teams get
>> started on using and experimenting with this stuff, given the very
>> real constraints right now, so we can get the feedback necessary to
>> improve the situation.
> 
> Since you think it is reasonable to control THP at container-level,
> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> (Asier cc'd)
> 
> In this patchset, Yafang uses BPF to adjust THP global configs based
> on VMA, which does not look a good approach to me. WDYT?
> 
> 
> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/
> 
> --
> Best Regards,
> Yan, Zi

Hi,

I believe cgroup is a better approach for containers, since this 
approach can be easily integrated with the user space stack like 
containerd and kubernets, which use cgroup to control system resources.

However, I pointed out earlier, the approach I suggested has some 
flaws:
1. Potential polution of cgroup with a big number of knobs
2. Requires configuration by the admin

Ideally, as Matthew W. mentioned, there should be an automatic system.

Anyway, regarding containers, I believe cgroup is a good approach 
given that the admin or the container management system uses cgroups 
to set up the containers.

-- 
Asier Gutierrez
Huawei