(cc'ing linux-api) On Mon, May 19, 2025 at 11:29:52PM +0100, Usama Arif wrote: > This series allows to change the THP policy of a process, according to the > value set in arg2, all of which will be inherited during fork+exec: > - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE > for the default VMA flags. It will also iterate through every VMA in the > process and call hugepage_madvise on it, with MADV_HUGEPAGE policy. > This effectively allows setting MADV_HUGEPAGE on the entire process. > In an environment where different types of workloads are run on the > same machine, this will allow workloads that benefit from always having > hugepages to do so, without regressing those that don't. > - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE > for the default VMA flags. It will also iterate through every VMA in the > process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy. > This effectively allows setting MADV_NOHUGEPAGE on the entire process. > In an environment where different types of workloads are run on the > same machine,this will allow workloads that benefit from having > hugepages on an madvise basis only to do so, without regressing those > that benefit from having hugepages always. > - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and > VM_NOHUGEPAGE process for the default flags. > > In hyperscalers, we have a single THP policy for the entire fleet. > We have different types of workloads (e.g. AI/compute/databases/etc) > running on a single server. > Some of these workloads will benefit from always getting THP at fault > (or collapsed by khugepaged), some of them will benefit by only getting > them at madvise. > > This series is useful for 2 usecases: > 1) global system policy = madvise, while we want some workloads to get THPs > at fault and by khugepaged :- some processes (e.g. AI workloads) benefits > from getting THPs at fault (and collapsed by khugepaged). Other workloads > like databases will incur regression (either a performance regression or > they are completely memory bound and even a very slight increase in memory > will cause them to OOM). So what these patches will do is allow setting > prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how > workloads are deployed in our (Meta's/Facebook) fleet at this moment). > > 2) global system policy = always, while we want some workloads to get THPs > only on madvise basis :- Same reason as 1). What these patches > will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database > workloads. (We hope this is us (Meta) in the near future, if a majority of > workloads show that they benefit from always, we flip the default host > setting to "always" across the fleet and workloads that regress can opt-out > and be "madvise". New services developed will then be tested with always by > default. "always" is also the default defconfig option upstream, so I would > imagine this is faced by others as well.) > > v2->v3: (Thanks Lorenzo for all the below feedback!) > v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@xxxxxxxxx/ > - no more flags2. > - no more MMF2_... > - renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE > - mmap_write_lock_killable acquired in PR_GET_THP_POLICY > - mmap_write lock fixed in PR_SET_THP_POLICY > - mmap assert check in process_default_madv_hugepage > - check if hugepage_global_enabled is enabled in the call and account for s390 > - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in > the way done by madvise(). I believe VM merge will not be broken in > this way. > - process_default_madv_hugepage function that does for_each_vma and calls > hugepage_madvise. > > v1->v2: > - change from modifying the THP decision making for the process, to modifying > VMA flags only. This prevents further complicating the logic used to > determine THP order (Thanks David!) > - change from using a prctl per policy change to just using PR_SET_THP_POLICY > and arg2 to set the policy. (Zi Yan) > - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM > - Add selftests and documentation. > > Usama Arif (7): > mm: khugepaged: extract vm flag setting outside of hugepage_madvise > prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process > prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process > prctl: introduce PR_THP_POLICY_SYSTEM for the process > selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE > selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE > docs: transhuge: document process level THP controls > > Documentation/admin-guide/mm/transhuge.rst | 42 +++ > include/linux/huge_mm.h | 2 + > include/linux/mm.h | 2 +- > include/linux/mm_types.h | 4 +- > include/uapi/linux/prctl.h | 6 + > kernel/sys.c | 53 ++++ > mm/huge_memory.c | 13 + > mm/khugepaged.c | 26 +- > tools/include/uapi/linux/prctl.h | 6 + > .../trace/beauty/include/uapi/linux/prctl.h | 6 + > tools/testing/selftests/prctl/Makefile | 2 +- > tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++ > 12 files changed, 436 insertions(+), 12 deletions(-) > create mode 100644 tools/testing/selftests/prctl/thp_policy.c > > -- > 2.47.1 > > -- Sincerely yours, Mike.