On 31 Jul 2025, at 8:27, Usama Arif wrote: > From: David Hildenbrand <david@xxxxxxxxxx> > > People want to make use of more THPs, for example, moving from > the "never" system policy to "madvise", or from "madvise" to "always". > > While this is great news for every THP desperately waiting to get > allocated out there, apparently there are some workloads that require a > bit of care during that transition: individual processes may need to > opt-out from this behavior for various reasons, and this should be > permitted without needing to make all other workloads on the system > similarly opt-out. > > The following scenarios are imaginable: > > (1) Switch from "none" system policy to "madvise"/"always", but keep THPs > disabled for selected workloads. > > (2) Stay at "none" system policy, but enable THPs for selected > workloads, making only these workloads use the "madvise" or "always" > policy. > > (3) Switch from "madvise" system policy to "always", but keep the > "madvise" policy for selected workloads: allocate THPs only when > advised. > > (4) Stay at "madvise" system policy, but enable THPs even when not advised > for selected workloads -- "always" policy. > > Once can emulate (2) through (1), by setting the system policy to > "madvise"/"always" while disabling THPs for all processes that don't want > THPs. It requires configuring all workloads, but that is a user-space > problem to sort out. > > (4) can be emulated through (3) in a similar way. > > Back when (1) was relevant in the past, as people started enabling THPs, > we added PR_SET_THP_DISABLE, so relevant workloads that were not ready > yet (i.e., used by Redis) were able to just disable THPs completely. Redis > still implements the option to use this interface to disable THPs > completely. > > With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a > workload -- a process, including fork+exec'ed process hierarchy. > That essentially made us support (1): simply disable THPs for all workloads > that are not ready for THPs yet, while still enabling THPs system-wide. > > The quest for handling (3) and (4) started, but current approaches > (completely new prctl, options to set other policies per process, > alternatives to prctl -- mctrl, cgroup handling) don't look particularly > promising. Likely, the future will use bpf or something similar to > implement better policies, in particular to also make better decisions > about THP sizes to use, but this will certainly take a while as that work > just started. > > Long story short: a simple enable/disable is not really suitable for the > future, so we're not willing to add completely new toggles. > > While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs > completely for these processes, this is a step backwards, because these > processes can no longer allocate THPs in regions where THPs were > explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that > imposes a problem for relevant workloads, because "not THPs" is certainly > worse than "THPs only when advised". > > Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not > explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this > would change the documented semantics quite a bit, and the versatility > to use it for debugging purposes, so I am not 100% sure that is what we > want -- although it would certainly be much easier. > > So instead, as an easy way forward for (3) and (4), add an option to > make PR_SET_THP_DISABLE disable *less* THPs for a process. > > In essence, this patch: > > (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 > of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). > > prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED). > > (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if > PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. > > Previously, it would return 1 if THPs were disabled completely. Now > it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED > was set. > > (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express > the semantics clearly. > > Fortunately, there are only two instances outside of prctl() code. > > (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs > with VM_HUGEPAGE" -- essentially "thp=madvise" behavior > > Fortunately, we only have to extend vma_thp_disabled(). > > (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are > disabled completely > > Only indicating that THPs are disabled when they are really disabled > completely, not only partially. > > For now, we don't add another interface to obtained whether THPs > are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If > ever required, we could add a new entry. > > The documented semantics in the man page for PR_SET_THP_DISABLE > "is inherited by a child created via fork(2) and is preserved across > execve(2)" is maintained. This behavior, for example, allows for > disabling THPs for a workload through the launching process (e.g., > systemd where we fork() a helper process to then exec()). > > For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and > VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space > thinks a THP is a good idea, we'll enable that separately next > (requiring a bit of cleanup first). > > There is currently not way to prevent that a process will not issue > PR_SET_THP_DISABLE itself to re-enable THP. There are not really known > users for re-enabling it, and it's against the purpose of the original > interface. So if ever required, we could investigate just forbidding to > re-enable them, or make this somehow configurable. > > Acked-by: Usama Arif <usamaarif642@xxxxxxxxx> > Tested-by: Usama Arif <usamaarif642@xxxxxxxxx> > Cc: Jonathan Corbet <corbet@xxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > Cc: Zi Yan <ziy@xxxxxxxxxx> > Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > Cc: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> > Cc: Nico Pache <npache@xxxxxxxxxx> > Cc: Ryan Roberts <ryan.roberts@xxxxxxx> > Cc: Dev Jain <dev.jain@xxxxxxx> > Cc: Barry Song <baohua@xxxxxxxxxx> > Cc: Vlastimil Babka <vbabka@xxxxxxx> > Cc: Mike Rapoport <rppt@xxxxxxxxxx> > Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Usama Arif <usamaarif642@xxxxxxxxx> > Cc: SeongJae Park <sj@xxxxxxxxxx> > Cc: Jann Horn <jannh@xxxxxxxxxx> > Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> > Cc: Yafang Shao <laoar.shao@xxxxxxxxx> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > --- > > At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I > think there might be real use cases where we want to disable any THPs -- > in particular also around debugging THP-related problems, and > "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE. > PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very > helpful for debugging purposes. Of course, I thought of having a > system-wide config option to modify PR_SET_THP_DISABLE behavior, but > I just don't like the semantics. > > "prctl: allow overriding system THP policy to always"[1] proposed > "overriding policies to always", which is just the wrong way around: we > should not add mechanisms to "enable more" when we already have an > interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets > weird otherwise. > > "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed > setting the default of the VM_HUGEPAGE, which is similarly the wrong way > around I think now. > > The ideas explored by Lorenzo to extend process_madvise()[3] and mctrl()[4] > similarly were around the "default for VM_HUGEPAGE" idea, but after the > discussion, I think we should better leave VM_HUGEPAGE untouched. > > Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where > we essentially want to say "leave advised regions alone" -- "keep THP > enabled for advised regions", > > The only thing I really dislike about this is using another MMF_* flag, > but well, no way around it -- and seems like we could easily support > more than 32 if we want to (most users already treat it like a proper > bitmap). > > I think this here (modifying an existing toggle) is the only prctl() > extension that we might be willing to accept. In general, I agree like > most others, that prctl() is a very bad interface for that -- but > PR_SET_THP_DISABLE is already there and is getting used. > > Long-term, I think the answer will be something based on bpf[5]. Maybe > in that context, I there could still be value in easily disabling THPs for > selected workloads (esp. debugging purposes). > > Jann raised valid concerns[6] about new flags that are persistent across > exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I > consider it having a similar security risk as our existing > PR_SET_THP_DISABLE, but devil is in the detail. > > [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@xxxxxxxxx > [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@xxxxxxxxx > [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@xxxxxxxxxx > [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local > [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@xxxxxxxxx > [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@xxxxxxxxxxxxxx > > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > --- > Documentation/filesystems/proc.rst | 5 ++- > fs/proc/array.c | 2 +- > include/linux/huge_mm.h | 20 +++++++--- > include/linux/mm_types.h | 13 +++---- > include/uapi/linux/prctl.h | 10 +++++ > kernel/sys.c | 59 ++++++++++++++++++++++++------ > mm/khugepaged.c | 2 +- > 7 files changed, 82 insertions(+), 29 deletions(-) > The changes look good to me. Acked-by: Zi Yan <ziy@xxxxxxxxxx> Best Regards, Yan, Zi