On 21/07/2025 10:09, David Hildenbrand wrote: > People want to make use of more THPs, for example, moving from > THP=never to THP=madvise, or from THP=madvise to THP=never. > > While this is great news for every THP desperately waiting to get > allocated out there, apparently there are some workloads that require a > bit of care during that transition: once problems are detected, these > workloads should be started with the old behavior, without making all > other workloads on the system go back to the old behavior as well. > > In essence, the following scenarios are imaginable: > > (1) Switch from THP=none to THP=madvise or THP=always, but keep the old > behavior (no THP) for selected workloads. > > (2) Stay at THP=none, but have "madvise" or "always" behavior for > selected workloads. > > (3) Switch from THP=madvise to THP=always, but keep the old behavior > (THP only when advised) for selected workloads. > > (4) Stay at THP=madvise, but have "always" behavior for selected > workloads. > > In essence, (2) can be emulated through (1), by setting THP!=none while > disabling THPs for all processes that don't want THPs. It requires > configuring all workloads, but that is a user-space problem to sort out. > > (4) can be emulated through (3) in a similar way. > > Back when (1) was relevant in the past, as people started enabling THPs, > we added PR_SET_THP_DISABLE, so relevant workloads that were not ready > yet (i.e., used by Redis) were able to just disable THPs completely. Redis > still implements the option to use this interface to disable THPs > completely. > > With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a > workload -- a process, including fork+exec'ed process hierarchy. > That essentially made us support (1): simply disable THPs for all workloads > that are not ready for THPs yet, while still enabling THPs system-wide. > > The quest for handling (3) and (4) started, but current approaches > (completely new prctl, options to set other policies per processm, > alternatives to prctl -- mctrl, cgroup handling) don't look particularly > promising. Likely, the future will use bpf or something similar to > implement better policies, in particular to also make better decisions > about THP sizes to use, but this will certainly take a while as that work > just started. > > Long story short: a simple enable/disable is not really suitable for the > future, so we're not willing to add completely new toggles. > > While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs > completely for these processes, this scares many THPs in our system > because they could no longer get allocated where they used to be allocated > for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a > problem for relevant workloads, because "not THPs" is certainly worse > than "THPs only when advised". > > Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not > explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this > would change the documented semantics quite a bit, and the versatility > to use it for debugging purposes, so I am not 100% sure that is what we > want -- although it would certainly be much easier. > > So instead, as an easy way forward for (3) and (4), an option to > make PR_SET_THP_DISABLE disable *less* THPs for a process. > > In essence, this patch: > > (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 > of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). > > For now, arg3 was not allowed to be set (-EINVAL). Now it holds > flags. > > (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if > PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. > > For now, it would return 1 if THPs were disabled completely. Now > it essentially returns the set flags as well. > No strong opinion, but maybe we have it return 2 (i.e. bit 1 set)? I know that you are returning bit 1 set to indicate the flag, and I know that everyone dislikes prctl so its likely no more flags will be added :), but in the off chance there are extra flags, than it can make the return value weird? If instead we return a value with only a single bit set, might be better? Again, no strong opinion here. > (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express > the semantics clearly. > > Fortunately, there are only two instances outside of prctl() code. > > (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs > with VM_HUGEPAGE" -- essentially "thp=madvise" behavior > > Fortunately, we only have to extend vma_thp_disabled(). > > (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not > disabled completely > > Only indicating that THPs are disabled when they are really disabled > completely, not only partially. > > The documented semantics in the man page for PR_SET_THP_DISABLE > "is inherited by a child created via fork(2) and is preserved across > execve(2)" is maintained. This behavior, for example, allows for > disabling THPs for a workload through the launching process (e.g., > systemd where we fork() a helper process to then exec()). > > There is currently not way to prevent that a process will not issue > PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option > to PR_SET_THP_DISABLE through another flag if ever required. The known > users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so > that is not added for now. > > Cc: Jonathan Corbet <corbet@xxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > Cc: Zi Yan <ziy@xxxxxxxxxx> > Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > Cc: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> > Cc: Nico Pache <npache@xxxxxxxxxx> > Cc: Ryan Roberts <ryan.roberts@xxxxxxx> > Cc: Dev Jain <dev.jain@xxxxxxx> > Cc: Barry Song <baohua@xxxxxxxxxx> > Cc: Vlastimil Babka <vbabka@xxxxxxx> > Cc: Mike Rapoport <rppt@xxxxxxxxxx> > Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Usama Arif <usamaarif642@xxxxxxxxx> > Cc: SeongJae Park <sj@xxxxxxxxxx> > Cc: Jann Horn <jannh@xxxxxxxxxx> > Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> > Cc: Yafang Shao <laoar.shao@xxxxxxxxx> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > > --- > > At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I > think there might be real use cases where we want to disable any THPs -- > in particular also around debugging THP-related problems, and > "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will > also block MADV_COLLAPSE, which can be very helpful. Of course, I thought > of having a system-wide config to change PR_SET_THP_DISABLE behavior, but > I just don't like the semantics. > > "prctl: allow overriding system THP policy to always"[1] proposed > "overriding policies to always", which is just the wrong way around: we > should not add mechanisms to "enable more" when we already have an > interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets > weird otherwise. > > "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed > setting the default of the VM_HUGEPAGE, which is similarly the wrong way > around I think now. > > The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4] > similarly were around the "default for VM_HUGEPAGE" idea, but after the > discussion, I think we should better leave VM_HUGEPAGE untouched. > > Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where > we essentially want to say "leave advised regions alone" -- "keep THP > enabled for advised regions", > > The only thing I really dislike about this is using another MMF_* flag, > but well, no way around it -- and seems like we could easily support > more than 32 if we want to, or storing this thp information elsewhere. > > I think this here (modifying an existing toggle) is the only prctl() > extension that we might be willing to accept. In general, I agree like > most others, that prctl() is a very bad interface for that -- but > PR_SET_THP_DISABLE is already there and is getting used. > > Long-term, I think the answer will be something based on bpf[5]. Maybe > in that context, I there could still be value in easily disabling THPs for > selected workloads (esp. debugging purposes). > > Jann raised valid concerns[6] about new flags that are persistent across > exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I > consider it having a similar security risk as our existing > PR_SET_THP_DISABLE, but devil is in the detail. > > This is *completely* untested and might be utterly broken. It merely > serves as a PoC of what I think could be done. If this ever goes upstream, > we need some kselftests for it, and extensive tests. > > [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@xxxxxxxxx > [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@xxxxxxxxx > [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@xxxxxxxxxx > [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local > [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@xxxxxxxxx > [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@xxxxxxxxxxxxxx > > --- > Documentation/filesystems/proc.rst | 5 +-- > fs/proc/array.c | 2 +- > include/linux/huge_mm.h | 20 ++++++++--- > include/linux/mm_types.h | 13 +++---- > include/uapi/linux/prctl.h | 7 ++++ > kernel/sys.c | 58 +++++++++++++++++++++++------- > mm/khugepaged.c | 2 +- > 7 files changed, 78 insertions(+), 29 deletions(-) > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > index 2971551b72353..915a3e44bc120 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -291,8 +291,9 @@ It's slow but very precise. > HugetlbPages size of hugetlb memory portions > CoreDumping process's memory is currently being dumped > (killing the process may lead to a corrupted core) > - THP_enabled process is allowed to use THP (returns 0 when > - PR_SET_THP_DISABLE is set on the process > + THP_enabled process is allowed to use THP (returns 0 when > + PR_SET_THP_DISABLE is set on the process to disable > + THP completely, not just partially) > Threads number of threads > SigQ number of signals queued/max. number for queue > SigPnd bitmap of pending signals for the thread > diff --git a/fs/proc/array.c b/fs/proc/array.c > index d6a0369caa931..c4f91a784104f 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm) > bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE); > > if (thp_enabled) > - thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags); > + thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags); > seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); > } > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index e0a27f80f390d..c4127104d9bc3 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -323,16 +323,26 @@ struct thpsize { > (transparent_hugepage_flags & \ > (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG)) > > +/* > + * Check whether THPs are explicitly disabled through madvise or prctl, or some > + * architectures may disable THP for some mappings, for example, s390 kvm. > + */ > static inline bool vma_thp_disabled(struct vm_area_struct *vma, > vm_flags_t vm_flags) > { > + /* Are THPs disabled for this VMA? */ > + if (vm_flags & VM_NOHUGEPAGE) > + return true; > + /* Are THPs disabled for all VMAs in the whole process? */ > + if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags)) > + return true; > /* > - * Explicitly disabled through madvise or prctl, or some > - * architectures may disable THP for some mappings, for > - * example, s390 kvm. > + * Are THPs disabled only for VMAs where we didn't get an explicit > + * advise to use them? > */ > - return (vm_flags & VM_NOHUGEPAGE) || > - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags); > + if (vm_flags & VM_HUGEPAGE) > + return false; > + return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags); > } > > static inline bool thp_disabled_by_hw(void) > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 1ec273b066915..a999f2d352648 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1743,19 +1743,16 @@ enum { > #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ > #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */ > > -/* > - * This one-shot flag is dropped due to necessity of changing exe once again > - * on NFS restore > - */ > -//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ > +#define MMF_HUGE_ZERO_PAGE 18 /* mm has ever used the global huge zero page */ > > #define MMF_HAS_UPROBES 19 /* has uprobes */ > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ > #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ > -#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ > -#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ > -#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) > +#define MMF_DISABLE_THP_EXCEPT_ADVISED 23 /* no THP except for VMAs with VM_HUGEPAGE */ > +#define MMF_DISABLE_THP_COMPLETELY 24 /* no THP for all VMAs */ > +#define MMF_DISABLE_THP_MASK ((1 << MMF_DISABLE_THP_COMPLETELY) |\ > + (1 << MMF_DISABLE_THP_EXCEPT_ADVISED)) > #define MMF_OOM_REAP_QUEUED 25 /* mm was queued for oom_reaper */ > #define MMF_MULTIPROCESS 26 /* mm is shared between processes */ > /* > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 43dec6eed559a..1949bb9270d48 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -177,7 +177,14 @@ struct prctl_mm_map { > > #define PR_GET_TID_ADDRESS 40 > > +/* > + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0 > + * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were > + * specified for PR_SET_THP_DISABLE. > + */ > #define PR_SET_THP_DISABLE 41 > +/* Don't disable THPs when explicitly advised (MADV_HUGEPAGE / VM_HUGEPAGE). */ > +# define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1) > #define PR_GET_THP_DISABLE 42 > > /* > diff --git a/kernel/sys.c b/kernel/sys.c > index b153fb345ada2..2a34b2f708900 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len) > return sizeof(mm->saved_auxv); > } > > +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3, > + unsigned long arg4, unsigned long arg5) > +{ > + unsigned long *mm_flags = ¤t->mm->flags; > + > + if (arg2 || arg3 || arg4 || arg5) > + return -EINVAL; > + > + if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags)) > + return 1; > + else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags)) > + return 1 | PR_THP_DISABLE_EXCEPT_ADVISED; > + return 0; > +} > + > +static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags, > + unsigned long arg4, unsigned long arg5) > +{ > + unsigned long *mm_flags = ¤t->mm->flags; > + > + if (arg4 || arg5) > + return -EINVAL; > + > + /* Flags are only allowed when disabling. */ > + if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED)) I think you meant over here? if (!thp_disable && (flags & PR_THP_DISABLE_EXCEPT_ADVISED)) > + return -EINVAL; > + if (mmap_write_lock_killable(current->mm)) > + return -EINTR; > + if (thp_disable) { > + if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) { > + clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags); > + set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags); > + } else { > + set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags); > + clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags); > + } > + } else { > + clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags); > + clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags); > + } > + mmap_write_unlock(current->mm); > + return 0; > +} > + > SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > unsigned long, arg4, unsigned long, arg5) > { > @@ -2596,20 +2640,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > return -EINVAL; > return task_no_new_privs(current) ? 1 : 0; > case PR_GET_THP_DISABLE: > - if (arg2 || arg3 || arg4 || arg5) > - return -EINVAL; > - error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags); > + error = prctl_get_thp_disable(arg2, arg3, arg4, arg5); > break; > case PR_SET_THP_DISABLE: > - if (arg3 || arg4 || arg5) > - return -EINVAL; > - if (mmap_write_lock_killable(me->mm)) > - return -EINTR; > - if (arg2) > - set_bit(MMF_DISABLE_THP, &me->mm->flags); > - else > - clear_bit(MMF_DISABLE_THP, &me->mm->flags); > - mmap_write_unlock(me->mm); > + error = prctl_set_thp_disable(arg2, arg3, arg4, arg5); > break; > case PR_MPX_ENABLE_MANAGEMENT: > case PR_MPX_DISABLE_MANAGEMENT: > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 8a5873d0a23a7..a685077644b4e 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -427,7 +427,7 @@ static inline int collapse_test_exit(struct mm_struct *mm) > static inline int collapse_test_exit_or_disable(struct mm_struct *mm) > { > return collapse_test_exit(mm) || > - test_bit(MMF_DISABLE_THP, &mm->flags); > + test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags); > } > > static bool hugepage_enabled(void) > > base-commit: 760b462b3921c5dc8bfa151d2d27a944e4e96081