On Mon, Jul 21, 2025 at 01:45:18PM +0200, David Hildenbrand wrote: > > > > > > (2) Stay at THP=none, but have "madvise" or "always" behavior for > > > selected workloads. > > > > > > (3) Switch from THP=madvise to THP=always, but keep the old behavior > > > (THP only when advised) for selected workloads. > > > > > > (4) Stay at THP=madvise, but have "always" behavior for selected > > > workloads. > > > > > > In essence, (2) can be emulated through (1), by setting THP!=none while > > > disabling THPs for all processes that don't want THPs. It requires > > > configuring all workloads, but that is a user-space problem to sort out. > > > > NIT: Delete 'In essence' here. > > I wanted "something" there to not make it look like the list keeps going on > in a weird way ;) I mean it's not a big deal :P just have memories of English teachers telling me off for repetition of such phrases... > > > While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs > > > completely for these processes, this scares many THPs in our system > > > because they could no longer get allocated where they used to be allocated > > > for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a > > > problem for relevant workloads, because "not THPs" is certainly worse > > > than "THPs only when advised". > > > > I don't know what you mean by 'scares' many THPs? :P > > They are very afraid of not getting allocated :) Haha they sound cute! > > > > > of having a system-wide config to change PR_SET_THP_DISABLE behavior, but > > > I just don't like the semantics. > > > > What do you mean? > > Kconfig option to change the behavior etc. In summary, I don't want to go > down that path, it all gets weird. Yeah please don't! :>) > > > Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where > > > we essentially want to say "leave advised regions alone" -- "keep THP > > > enabled for advised regions", > > > > Seems OK to me. I guess the one point of confusion could be people being > > confused between the THP toggle 'madvise' vs. actually having MADV_HUGEPAGE > > set, but that's moot, because 'madvise' mode only enables THP if the region > > has had MADV_HUGEPAGE set. > > Right, whatever ends up setting VM_HUGEPAGE. Yeah this naming is fine iMO. > > > > > > > > > The only thing I really dislike about this is using another MMF_* flag, > > > but well, no way around it -- and seems like we could easily support > > > more than 32 if we want to, or storing this thp information elsewhere. > > > > Yes my exact thoughts. But I will be adding a series to change this for VMA > > flags soon enough, and can potentially do mm flags at the same time... > > > > So this shouldn't in the end be as much of a problem. > > > > Maybe it's worth holding off on this until I've done that? But at any rate > > I intend to do those changes next cycle, and this will be a next cycle > > thing at the earliest anyway. > > I don't think this series must be blocked by that. Using a bitmap instead of > a single "unsigned long" should be fairly easy later -- I did not identify > any big blockers. Yeah that's fine. And I don't know when I will get the bitmap changes done, so let's not block this with that! > > > This is *completely* untested and might be utterly broken. It merely > > > serves as a PoC of what I think could be done. If this ever goes upstream, > > > we need some kselftests for it, and extensive tests. > > > > Well :) I mean we should definitely try this out in anger and it _MUST_ > > have self tests and put under some pressure. > > > > Usama, can you attack this and see? > > Yes, that's what I am hoping for. Cool. And of course Usama is best placed to experiment with this approach, as he can experiment with workloads relevant to this requirement. > > > > > > > > > [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@xxxxxxxxx > > > [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@xxxxxxxxx > > > [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@xxxxxxxxxx > > > [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local > > > [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@xxxxxxxxx > > > [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@xxxxxxxxxxxxxx > > > > > > --- > > > Documentation/filesystems/proc.rst | 5 +-- > > > fs/proc/array.c | 2 +- > > > include/linux/huge_mm.h | 20 ++++++++--- > > > include/linux/mm_types.h | 13 +++---- > > > include/uapi/linux/prctl.h | 7 ++++ > > > kernel/sys.c | 58 +++++++++++++++++++++++------- > > > mm/khugepaged.c | 2 +- > > > 7 files changed, 78 insertions(+), 29 deletions(-) > > > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > > > index 2971551b72353..915a3e44bc120 100644 > > > --- a/Documentation/filesystems/proc.rst > > > +++ b/Documentation/filesystems/proc.rst > > > @@ -291,8 +291,9 @@ It's slow but very precise. > > > HugetlbPages size of hugetlb memory portions > > > CoreDumping process's memory is currently being dumped > > > (killing the process may lead to a corrupted core) > > > - THP_enabled process is allowed to use THP (returns 0 when > > > - PR_SET_THP_DISABLE is set on the process > > > + THP_enabled process is allowed to use THP (returns 0 when > > > + PR_SET_THP_DISABLE is set on the process to disable > > > + THP completely, not just partially) > > > > Hmm but this means we have no way of knowing if it's set for partial > > Yes. I briefly thought about indicating another member, but then I thought > (a) it's ugly and (b) "who cares". > > I also thought about just printing "partial" instead of "1", but not sure if > that would break any parser. Hm and >1 could break a user who expects this to be 0, 1. We can always add a new entry if needed. > > > { > > > + /* Are THPs disabled for this VMA? */ > > > + if (vm_flags & VM_NOHUGEPAGE) > > > + return true; > > > + /* Are THPs disabled for all VMAs in the whole process? */ > > > + if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags)) > > > + return true; > > > /* > > > - * Explicitly disabled through madvise or prctl, or some > > > - * architectures may disable THP for some mappings, for > > > - * example, s390 kvm. > > > + * Are THPs disabled only for VMAs where we didn't get an explicit > > > + * advise to use them? > > > > Probably fine to drop the rather pernickety reference to s390 kvm here, I > > mean we don't need to spell out massively specific details in a general > > handler. > > No strong opinion. I mean what I'm saying is this is fine :P Got no problem wtih removing this bit of the comment. > > > > > > */ > > > - return (vm_flags & VM_NOHUGEPAGE) || > > > - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags); > > > + if (vm_flags & VM_HUGEPAGE) > > > + return false; > > > + return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags); > > > } > > > > > > static inline bool thp_disabled_by_hw(void) > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > > index 1ec273b066915..a999f2d352648 100644 > > > --- a/include/linux/mm_types.h > > > +++ b/include/linux/mm_types.h > > > @@ -1743,19 +1743,16 @@ enum { > > > #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ > > > #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */ > > > > > > -/* > > > - * This one-shot flag is dropped due to necessity of changing exe once again > > > - * on NFS restore > > > - */ > > > -//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ > > > +#define MMF_HUGE_ZERO_PAGE 18 /* mm has ever used the global huge zero page */ > > > > > > #define MMF_HAS_UPROBES 19 /* has uprobes */ > > > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > > > #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ > > > #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ > > > -#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */ > > > -#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ > > > -#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) > > > +#define MMF_DISABLE_THP_EXCEPT_ADVISED 23 /* no THP except for VMAs with VM_HUGEPAGE */ > > > +#define MMF_DISABLE_THP_COMPLETELY 24 /* no THP for all VMAs */ > > > +#define MMF_DISABLE_THP_MASK ((1 << MMF_DISABLE_THP_COMPLETELY) |\ > > > + (1 << MMF_DISABLE_THP_EXCEPT_ADVISED)) > > > > It feels a bit sigh to have to use up low-supply mm flags for this. But > > again, I should be attacking this shortage soon enough. > > > > Are we not going ahead with Barry's series that was going to use one of > > these in the end? > > Whoever gets acked first ;) ;) > > > > > > #define MMF_OOM_REAP_QUEUED 25 /* mm was queued for oom_reaper */ > > > #define MMF_MULTIPROCESS 26 /* mm is shared between processes */ > > > /* > > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > > > index 43dec6eed559a..1949bb9270d48 100644 > > > --- a/include/uapi/linux/prctl.h > > > +++ b/include/uapi/linux/prctl.h > > > @@ -177,7 +177,14 @@ struct prctl_mm_map { > > > > > > #define PR_GET_TID_ADDRESS 40 > > > > > > +/* > > > + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0 > > > + * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were > > > + * specified for PR_SET_THP_DISABLE. > > > + */ > > > > Probably worth specifying that you're just returning the flags here. > > Yes. > > Thanks! Cheers! > > -- > Cheers, > > David / dhildenb > Cheers, Lorenzo