On Mon, May 5, 2025 at 5:11 PM Gutierrez Asier <gutierrez.asier@xxxxxxxxxxxxxxxxxxx> wrote: > > > > On 5/2/2025 8:48 AM, Yafang Shao wrote: > > On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier > > <gutierrez.asier@xxxxxxxxxxxxxxxxxxx> wrote: > >> > >> > >> On 4/30/2025 8:53 PM, Zi Yan wrote: > >>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote: > >>> > >>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote: > >>>>>>>> If it isn't, can you state why? > >>>>>>>> > >>>>>>>> The main difference is that you are saying it's in a container that you > >>>>>>>> don't control. Your plan is to violate the control the internal > >>>>>>>> applications have over THP because you know better. I'm not sure how > >>>>>>>> people might feel about you messing with workloads, > >>>>>>> > >>>>>>> It’s not a mess. They have the option to deploy their services on > >>>>>>> dedicated servers, but they would need to pay more for that choice. > >>>>>>> This is a two-way decision. > >>>>>> > >>>>>> This implies you want a container-level way of controlling the setting > >>>>>> and not a system service-level? > >>>>> > >>>>> Right. We want to control the THP per container. > >>>> > >>>> This does strike me as a reasonable usecase. > >>>> > >>>> I think there is consensus that in the long-term we want this stuff to > >>>> just work and truly be transparent to userspace. > >>>> > >>>> In the short-to-medium term, however, there are still quite a few > >>>> caveats. thp=always can significantly increase the memory footprint of > >>>> sparse virtual regions. Huge allocations are not as cheap and reliable > >>>> as we would like them to be, which for real production systems means > >>>> having to make workload-specifcic choices and tradeoffs. > >>>> > >>>> There is ongoing work in these areas, but we do have a bit of a > >>>> chicken-and-egg problem: on the one hand, huge page adoption is slow > >>>> due to limitations in how they can be deployed. For example, we can't > >>>> do thp=always on a DC node that runs arbitary combinations of jobs > >>>> from a wide array of services. Some might benefit, some might hurt. > >>>> > >>>> Yet, it's much easier to improve the kernel based on exactly such > >>>> production experience and data from real-world usecases. We can't > >>>> improve the THP shrinker if we can't run THP. > >>>> > >>>> So I don't see it as overriding whoever wrote the software running > >>>> inside the container. They don't know, and they shouldn't have to care > >>>> about page sizes. It's about letting admins and kernel teams get > >>>> started on using and experimenting with this stuff, given the very > >>>> real constraints right now, so we can get the feedback necessary to > >>>> improve the situation. > >>> > >>> Since you think it is reasonable to control THP at container-level, > >>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]? > >>> (Asier cc'd) > >>> > >>> In this patchset, Yafang uses BPF to adjust THP global configs based > >>> on VMA, which does not look a good approach to me. WDYT? > >>> > >>> > >>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ > >>> > >>> -- > >>> Best Regards, > >>> Yan, Zi > >> > >> Hi, > >> > >> I believe cgroup is a better approach for containers, since this > >> approach can be easily integrated with the user space stack like > >> containerd and kubernets, which use cgroup to control system resources. > > > > The integration of BPF with containerd and Kubernetes is emerging as a > > clear trend. > > > > No, eBPF is not used for resource management, it is mainly used by the > network stack (CNI), monitoring and security. This is the most well-known use case of BPF in Kubernetes, thanks to Cilium. > All the resource > management by Kubernetes is done using cgroups. The landscape has shifted. As Johannes (the memcg maintainer) noted[0], "Cgroups are for nested trees dividing up resources. They're not a good fit for arbitrary, non-hierarchical policy settings." [0]. https://lore.kernel.org/linux-mm/20250430175954.GD2020@xxxxxxxxxxx/ > You are very unlikely > to convince the Kubernetes community to manage memory resources using > eBPF. Kubernetes already natively supports this capability. As documented in the Container Lifecycle Hooks guide[1], you can easily load BPF programs as plugins using these hooks. This is exactly the approach we've successfully implemented in our production environments. [1]. https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/ -- Regards Yafang