On Fri, May 2, 2025 at 9:04 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 02.05.25 14:18, Yafang Shao wrote: > > On Fri, May 2, 2025 at 8:00 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > >> > >> On 2 May 2025, at 1:48, Yafang Shao wrote: > >> > >>> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier > >>> <gutierrez.asier@xxxxxxxxxxxxxxxxxxx> wrote: > >>>> > >>>> > >>>> On 4/30/2025 8:53 PM, Zi Yan wrote: > >>>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote: > >>>>> > >>>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote: > >>>>>>>>>> If it isn't, can you state why? > >>>>>>>>>> > >>>>>>>>>> The main difference is that you are saying it's in a container that you > >>>>>>>>>> don't control. Your plan is to violate the control the internal > >>>>>>>>>> applications have over THP because you know better. I'm not sure how > >>>>>>>>>> people might feel about you messing with workloads, > >>>>>>>>> > >>>>>>>>> It’s not a mess. They have the option to deploy their services on > >>>>>>>>> dedicated servers, but they would need to pay more for that choice. > >>>>>>>>> This is a two-way decision. > >>>>>>>> > >>>>>>>> This implies you want a container-level way of controlling the setting > >>>>>>>> and not a system service-level? > >>>>>>> > >>>>>>> Right. We want to control the THP per container. > >>>>>> > >>>>>> This does strike me as a reasonable usecase. > >>>>>> > >>>>>> I think there is consensus that in the long-term we want this stuff to > >>>>>> just work and truly be transparent to userspace. > >>>>>> > >>>>>> In the short-to-medium term, however, there are still quite a few > >>>>>> caveats. thp=always can significantly increase the memory footprint of > >>>>>> sparse virtual regions. Huge allocations are not as cheap and reliable > >>>>>> as we would like them to be, which for real production systems means > >>>>>> having to make workload-specifcic choices and tradeoffs. > >>>>>> > >>>>>> There is ongoing work in these areas, but we do have a bit of a > >>>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow > >>>>>> due to limitations in how they can be deployed. For example, we can't > >>>>>> do thp=always on a DC node that runs arbitary combinations of jobs > >>>>>> from a wide array of services. Some might benefit, some might hurt. > >>>>>> > >>>>>> Yet, it's much easier to improve the kernel based on exactly such > >>>>>> production experience and data from real-world usecases. We can't > >>>>>> improve the THP shrinker if we can't run THP. > >>>>>> > >>>>>> So I don't see it as overriding whoever wrote the software running > >>>>>> inside the container. They don't know, and they shouldn't have to care > >>>>>> about page sizes. It's about letting admins and kernel teams get > >>>>>> started on using and experimenting with this stuff, given the very > >>>>>> real constraints right now, so we can get the feedback necessary to > >>>>>> improve the situation. > >>>>> > >>>>> Since you think it is reasonable to control THP at container-level, > >>>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]? > >>>>> (Asier cc'd) > >>>>> > >>>>> In this patchset, Yafang uses BPF to adjust THP global configs based > >>>>> on VMA, which does not look a good approach to me. WDYT? > >>>>> > >>>>> > >>>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ > >>>>> > >>>>> -- > >>>>> Best Regards, > >>>>> Yan, Zi > >>>> > >>>> Hi, > >>>> > >>>> I believe cgroup is a better approach for containers, since this > >>>> approach can be easily integrated with the user space stack like > >>>> containerd and kubernets, which use cgroup to control system resources. > >>> > >>> The integration of BPF with containerd and Kubernetes is emerging as a > >>> clear trend. > >>> > >>>> > >>>> However, I pointed out earlier, the approach I suggested has some > >>>> flaws: > >>>> 1. Potential polution of cgroup with a big number of knobs > >>> > >>> Right, the memcg maintainers once told me that introducing a new > >>> cgroup file means committing to maintaining it indefinitely, as these > >>> interface files are treated as part of the ABI. > >>> In contrast, BPF kfuncs are considered an unstable API, giving you the > >>> flexibility to modify them later if needed. > >>> > >>>> 2. Requires configuration by the admin > >>>> > >>>> Ideally, as Matthew W. mentioned, there should be an automatic system. > >>> > >>> Take Matthew’s XFS large folio feature as an example—it was enabled > >>> automatically. A few years ago, when we upgraded to the 6.1.y stable > >>> kernel, we noticed this new feature. Since it was enabled by default, > >>> we assumed the author was confident in its stability. Unfortunately, > >>> it led to severe issues in our production environment: servers crashed > >>> randomly, and in some cases, we experienced data loss without > >>> understanding the root cause. > >>> > >>> We began disabling various kernel configurations in an attempt to > >>> isolate the issue, and eventually, the problem disappeared after > >>> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new > >>> kernel version with THP disabled and had to restart hundreds of > >>> thousands of production servers. It was a nightmare for both us and > >>> our sysadmins. > >>> > >>> Last year, we discovered that the initial issue had been resolved by this patch: > >>> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@xxxxxxxxx/. > >>> We backported the fix and re-enabled XFS large folios—only to face a > >>> new nightmare. One of our services began crashing sporadically with > >>> core dumps. It took us several months to trace the issue back to the > >>> re-enabled XFS large folio feature. Fortunately, we were able to > >>> disable it using livepatch, avoiding another round of mass server > >>> restarts. To this day, the root cause remains unknown. The good news > >>> is that the issue appears to be resolved in the 6.12.y stable kernel. > >>> We're still trying to bisect which commit fixed it, though progress is > >>> slow because the issue is not reliably reproducible. > >> > >> This is a very wrong attitude towards open source projects. You sounded > >> like, whether intended or not, Linux community should provide issue-free > >> kernels and is responsible for fixing all issues. But that is wrong. > >> Since you are using the kernel, you could help improve it like Kairong > >> is doing instead of waiting for others to fix the issue. > >> > >>> > >>> In theory, new features should be enabled automatically. But in > >>> practice, every new feature should come with a tunable knob. That’s a > >>> lesson we learned the hard way from this experience—and perhaps > >>> Matthew did too. > >> > >> That means new features will not get enough testing. People like you > >> will just simply disable all new features and wait for they are stable. > >> It would never come without testing and bug fixes. > Hello David, Thanks for your replyment. > We do have the concept of EXPERIMENTAL kernel configs, that are either > expected get removed completely ("always enabled") or get turned into > actual long-term kernel options. But yeah, it's always tricky what we > actually want to put behind such options. > > I mean, READ_ONLY_THP_FOR_FS is still around and still EXPERIMENTAL ... READ_ONLY_THP_FOR_FS is not enabled in our 6.1 kernel, as we are cautious about enabling any EXPERIMENTAL feature. XFS large folio support operates independently of READ_ONLY_THP_FOR_FS. However, it is automatically enabled when CONFIG_TRANSPARENT_HUGEPAGE is set, as seen in the 6.1.y stable kernel mapping_large_folio_support(). > > Distro kernels are usually very careful about what to backport and what > to support. Once we (working for a distro) do backport + test, we > usually find some additional things that upstream hasn't spotted yet: in > particular, because some workloads are only run in that form on distro > kernels. We also ran into some issues with large folios (e.g., me > personally with s390x KVM guests) and trying our best to fix them. We also worked on this. As you may recall, I previously fixed a large folio bug, which was merged into the 6.1.y stable kernel [0]. [0]. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.1.y&id=a3f8ee15228c89ce3713ee7e8e82f6d8a13fdb4b > > It can be quite time consuming, so I can understand that not everybody > has the time to invest into heavy debugging, especially if it's > extremely hard to reproduce (or even corrupts data :( ). Correct. If the vmcore is incomplete, it is nearly impossible to reliably determine the root cause. The best approach is to isolate the issue as quickly as possible. > > I agree that adding a toggle after the effects to work around issues is > not the right approach. Introducing a EXPERIMENTAL toggle early because > one suspects complicated interactions in a different story. It's > absolutely not trivial to make that decision. In this patchset, we are not introducing a toggle as a workaround. Rather, the change reflects the fact that some workloads benefit from THP while others are negatively impacted. Therefore, it makes sense to enable THP selectively based on workload characteristics. > > > > > Pardon me? > > This discussion has taken such an unexpected turn that I don’t feel > > the need to explain what I’ve contributed to the Linux community over > > the past few years. > > I'm sure Zi Yan didn't mean to insult you. I would have phrased it as: > > "It's difficult to decide which toggles make sense. There is a fine line > between adding a toggle and not getting people actually testing it to > stabilize it vs. not adding a toggle and forcing people to test it and > fix it/report issues." > > Ideally, we'd find most issue in the RC phase or at least shortly after. > > You've been active in the kernel for a long time, please don't feel like > the community is not appreciating that. Thank you for the clarification. I truly appreciate your patience and thoroughness. -- Regards Yafang