On 30 Apr 2025, at 13:45, Johannes Weiner wrote: > On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote: >>>>> If it isn't, can you state why? >>>>> >>>>> The main difference is that you are saying it's in a container that you >>>>> don't control. Your plan is to violate the control the internal >>>>> applications have over THP because you know better. I'm not sure how >>>>> people might feel about you messing with workloads, >>>> >>>> It’s not a mess. They have the option to deploy their services on >>>> dedicated servers, but they would need to pay more for that choice. >>>> This is a two-way decision. >>> >>> This implies you want a container-level way of controlling the setting >>> and not a system service-level? >> >> Right. We want to control the THP per container. > > This does strike me as a reasonable usecase. > > I think there is consensus that in the long-term we want this stuff to > just work and truly be transparent to userspace. > > In the short-to-medium term, however, there are still quite a few > caveats. thp=always can significantly increase the memory footprint of > sparse virtual regions. Huge allocations are not as cheap and reliable > as we would like them to be, which for real production systems means > having to make workload-specifcic choices and tradeoffs. > > There is ongoing work in these areas, but we do have a bit of a > chicken-and-egg problem: on the one hand, huge page adoption is slow > due to limitations in how they can be deployed. For example, we can't > do thp=always on a DC node that runs arbitary combinations of jobs > from a wide array of services. Some might benefit, some might hurt. > > Yet, it's much easier to improve the kernel based on exactly such > production experience and data from real-world usecases. We can't > improve the THP shrinker if we can't run THP. > > So I don't see it as overriding whoever wrote the software running > inside the container. They don't know, and they shouldn't have to care > about page sizes. It's about letting admins and kernel teams get > started on using and experimenting with this stuff, given the very > real constraints right now, so we can get the feedback necessary to > improve the situation. Since you think it is reasonable to control THP at container-level, namely per-cgroup. Should we reconsider cgroup-based THP control[1]? (Asier cc'd) In this patchset, Yafang uses BPF to adjust THP global configs based on VMA, which does not look a good approach to me. WDYT? [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ -- Best Regards, Yan, Zi