On 06/06/2025 18:37, David Hildenbrand wrote: > On 06.06.25 16:37, Usama Arif wrote: >> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the >> watermarks are evaluated to extremely high values, for e.g. a server with >> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest >> of the sizes set to never, the min, low and high watermarks evaluate to >> 11.2G, 14G and 16.8G respectively. >> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage >> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M >> and 1G respectively. >> This is because set_recommended_min_free_kbytes is designed for PMD >> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)). >> Such high watermark values can cause performance and latency issues in >> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough >> most of them would never actually use a 512M PMD THP. >> >> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large >> folio order enabled in set_recommended_min_free_kbytes. >> With this patch, when only 2M THP hugepage size is set to madvise for the >> same machine with 64K page size, with the rest of the sizes set to never, >> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G >> respectively. When 512M THP hugepage size is set to madvise for the same >> machine with 64K page size, the min, low and high watermarks evaluate to >> 11.2G, 14G and 16.8G respectively, the same as without this patch. >> >> An alternative solution would be to change PAGE_BLOCK_ORDER by changing >> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this >> is not dynamic with hugepage size, will need different kernel builds for >> different hugepage sizes and most users won't know that this needs to be >> done as it can be difficult to detmermine that the performance and latency >> issues are coming from the high watermark values. >> >> All watermark numbers are for zones of nodes that had the highest number >> of pages, i.e. the value for min size for 4K is obtained using: >> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}'; >> and for 64K using: >> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}'; >> >> An arbirtary min of 128 pages is used for when no hugepage sizes are set >> enabled. >> >> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx> >> --- >> include/linux/huge_mm.h | 25 +++++++++++++++++++++++++ >> mm/khugepaged.c | 32 ++++++++++++++++++++++++++++---- >> mm/shmem.c | 29 +++++------------------------ >> 3 files changed, 58 insertions(+), 28 deletions(-) >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 2f190c90192d..fb4e51ef0acb 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item) >> } >> #endif >> +/* >> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option >> + * >> + * SHMEM_HUGE_NEVER: >> + * disables huge pages for the mount; >> + * SHMEM_HUGE_ALWAYS: >> + * enables huge pages for the mount; >> + * SHMEM_HUGE_WITHIN_SIZE: >> + * only allocate huge pages if the page will be fully within i_size, >> + * also respect madvise() hints; >> + * SHMEM_HUGE_ADVISE: >> + * only allocate huge pages if requested with madvise(); >> + */ >> + >> + #define SHMEM_HUGE_NEVER 0 >> + #define SHMEM_HUGE_ALWAYS 1 >> + #define SHMEM_HUGE_WITHIN_SIZE 2 >> + #define SHMEM_HUGE_ADVISE 3 >> + >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> extern unsigned long transparent_hugepage_flags; >> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always; >> extern unsigned long huge_anon_orders_madvise; >> extern unsigned long huge_anon_orders_inherit; >> +extern int shmem_huge __read_mostly; >> +extern unsigned long huge_shmem_orders_always; >> +extern unsigned long huge_shmem_orders_madvise; >> +extern unsigned long huge_shmem_orders_inherit; >> +extern unsigned long huge_shmem_orders_within_size; > > Do really all of these have to be exported? > Hi David, Thanks for the review! For the RFC, I just did it similar to the anon ones when I got the build error trying to use these, but yeah a much better approach would be to just have a function in shmem that would return the largest shmem thp allowable order. >> + >> static inline bool hugepage_global_enabled(void) >> { >> return transparent_hugepage_flags & >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 15203ea7d007..e64cba74eb2a 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none) >> return 0; >> } >> +static int thp_highest_allowable_order(void) > > Did you mean "largest" ? Yes > >> +{ >> + unsigned long orders = READ_ONCE(huge_anon_orders_always) >> + | READ_ONCE(huge_anon_orders_madvise) >> + | READ_ONCE(huge_shmem_orders_always) >> + | READ_ONCE(huge_shmem_orders_madvise) >> + | READ_ONCE(huge_shmem_orders_within_size); >> + if (hugepage_global_enabled()) >> + orders |= READ_ONCE(huge_anon_orders_inherit); >> + if (shmem_huge != SHMEM_HUGE_NEVER) >> + orders |= READ_ONCE(huge_shmem_orders_inherit); >> + >> + return orders == 0 ? 0 : fls(orders) - 1; >> +} > > But how does this interact with large folios / THPs in the page cache? > Yes this will be a problem.