On 4/30/2025 8:53 PM, Zi Yan wrote: > On 30 Apr 2025, at 13:45, Johannes Weiner wrote: > >> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote: >>>>>> If it isn't, can you state why? >>>>>> >>>>>> The main difference is that you are saying it's in a container that you >>>>>> don't control. Your plan is to violate the control the internal >>>>>> applications have over THP because you know better. I'm not sure how >>>>>> people might feel about you messing with workloads, >>>>> >>>>> It’s not a mess. They have the option to deploy their services on >>>>> dedicated servers, but they would need to pay more for that choice. >>>>> This is a two-way decision. >>>> >>>> This implies you want a container-level way of controlling the setting >>>> and not a system service-level? >>> >>> Right. We want to control the THP per container. >> >> This does strike me as a reasonable usecase. >> >> I think there is consensus that in the long-term we want this stuff to >> just work and truly be transparent to userspace. >> >> In the short-to-medium term, however, there are still quite a few >> caveats. thp=always can significantly increase the memory footprint of >> sparse virtual regions. Huge allocations are not as cheap and reliable >> as we would like them to be, which for real production systems means >> having to make workload-specifcic choices and tradeoffs. >> >> There is ongoing work in these areas, but we do have a bit of a >> chicken-and-egg problem: on the one hand, huge page adoption is slow >> due to limitations in how they can be deployed. For example, we can't >> do thp=always on a DC node that runs arbitary combinations of jobs >> from a wide array of services. Some might benefit, some might hurt. >> >> Yet, it's much easier to improve the kernel based on exactly such >> production experience and data from real-world usecases. We can't >> improve the THP shrinker if we can't run THP. >> >> So I don't see it as overriding whoever wrote the software running >> inside the container. They don't know, and they shouldn't have to care >> about page sizes. It's about letting admins and kernel teams get >> started on using and experimenting with this stuff, given the very >> real constraints right now, so we can get the feedback necessary to >> improve the situation. > > Since you think it is reasonable to control THP at container-level, > namely per-cgroup. Should we reconsider cgroup-based THP control[1]? > (Asier cc'd) > > In this patchset, Yafang uses BPF to adjust THP global configs based > on VMA, which does not look a good approach to me. WDYT? > > > [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ > > -- > Best Regards, > Yan, Zi Hi, I believe cgroup is a better approach for containers, since this approach can be easily integrated with the user space stack like containerd and kubernets, which use cgroup to control system resources. However, I pointed out earlier, the approach I suggested has some flaws: 1. Potential polution of cgroup with a big number of knobs 2. Requires configuration by the admin Ideally, as Matthew W. mentioned, there should be an automatic system. Anyway, regarding containers, I believe cgroup is a good approach given that the admin or the container management system uses cgroups to set up the containers. -- Asier Gutierrez Huawei