This series allows to change the THP policy of a process, according to the value set in arg2, all of which will be inherited during fork+exec: - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE for the default VMA flags. It will also iterate through every VMA in the process and call hugepage_madvise on it, with MADV_HUGEPAGE policy. This effectively allows setting MADV_HUGEPAGE on the entire process. In an environment where different types of workloads are run on the same machine, this will allow workloads that benefit from always having hugepages to do so, without regressing those that don't. - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE for the default VMA flags. It will also iterate through every VMA in the process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy. This effectively allows setting MADV_NOHUGEPAGE on the entire process. In an environment where different types of workloads are run on the same machine,this will allow workloads that benefit from having hugepages on an madvise basis only to do so, without regressing those that benefit from having hugepages always. - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and VM_NOHUGEPAGE process for the default flags. In hyperscalers, we have a single THP policy for the entire fleet. We have different types of workloads (e.g. AI/compute/databases/etc) running on a single server. Some of these workloads will benefit from always getting THP at fault (or collapsed by khugepaged), some of them will benefit by only getting them at madvise. This series is useful for 2 usecases: 1) global system policy = madvise, while we want some workloads to get THPs at fault and by khugepaged :- some processes (e.g. AI workloads) benefits from getting THPs at fault (and collapsed by khugepaged). Other workloads like databases will incur regression (either a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM). So what these patches will do is allow setting prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how workloads are deployed in our (Meta's/Facebook) fleet at this moment). 2) global system policy = always, while we want some workloads to get THPs only on madvise basis :- Same reason as 1). What these patches will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database workloads. (We hope this is us (Meta) in the near future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" across the fleet and workloads that regress can opt-out and be "madvise". New services developed will then be tested with always by default. "always" is also the default defconfig option upstream, so I would imagine this is faced by others as well.) v2->v3: (Thanks Lorenzo for all the below feedback!) v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@xxxxxxxxx/ - no more flags2. - no more MMF2_... - renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE - mmap_write_lock_killable acquired in PR_GET_THP_POLICY - mmap_write lock fixed in PR_SET_THP_POLICY - mmap assert check in process_default_madv_hugepage - check if hugepage_global_enabled is enabled in the call and account for s390 - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in the way done by madvise(). I believe VM merge will not be broken in this way. - process_default_madv_hugepage function that does for_each_vma and calls hugepage_madvise. v1->v2: - change from modifying the THP decision making for the process, to modifying VMA flags only. This prevents further complicating the logic used to determine THP order (Thanks David!) - change from using a prctl per policy change to just using PR_SET_THP_POLICY and arg2 to set the policy. (Zi Yan) - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM - Add selftests and documentation. Usama Arif (7): mm: khugepaged: extract vm flag setting outside of hugepage_madvise prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process prctl: introduce PR_THP_POLICY_SYSTEM for the process selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE docs: transhuge: document process level THP controls Documentation/admin-guide/mm/transhuge.rst | 42 +++ include/linux/huge_mm.h | 2 + include/linux/mm.h | 2 +- include/linux/mm_types.h | 4 +- include/uapi/linux/prctl.h | 6 + kernel/sys.c | 53 ++++ mm/huge_memory.c | 13 + mm/khugepaged.c | 26 +- tools/include/uapi/linux/prctl.h | 6 + .../trace/beauty/include/uapi/linux/prctl.h | 6 + tools/testing/selftests/prctl/Makefile | 2 +- tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++ 12 files changed, 436 insertions(+), 12 deletions(-) create mode 100644 tools/testing/selftests/prctl/thp_policy.c -- 2.47.1