On 9 Jun 2025, at 7:34, Usama Arif wrote: > On 06/06/2025 18:37, David Hildenbrand wrote: >> On 06.06.25 16:37, Usama Arif wrote: >>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the >>> watermarks are evaluated to extremely high values, for e.g. a server with >>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest >>> of the sizes set to never, the min, low and high watermarks evaluate to >>> 11.2G, 14G and 16.8G respectively. >>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage >>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M >>> and 1G respectively. >>> This is because set_recommended_min_free_kbytes is designed for PMD >>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)). >>> Such high watermark values can cause performance and latency issues in >>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough >>> most of them would never actually use a 512M PMD THP. >>> >>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large >>> folio order enabled in set_recommended_min_free_kbytes. >>> With this patch, when only 2M THP hugepage size is set to madvise for the >>> same machine with 64K page size, with the rest of the sizes set to never, >>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G >>> respectively. When 512M THP hugepage size is set to madvise for the same >>> machine with 64K page size, the min, low and high watermarks evaluate to >>> 11.2G, 14G and 16.8G respectively, the same as without this patch. >>> >>> An alternative solution would be to change PAGE_BLOCK_ORDER by changing >>> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this >>> is not dynamic with hugepage size, will need different kernel builds for >>> different hugepage sizes and most users won't know that this needs to be >>> done as it can be difficult to detmermine that the performance and latency >>> issues are coming from the high watermark values. >>> >>> All watermark numbers are for zones of nodes that had the highest number >>> of pages, i.e. the value for min size for 4K is obtained using: >>> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}'; >>> and for 64K using: >>> cat /proc/zoneinfo | grep -i min | awk '{print $2}' | sort -n | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}'; >>> >>> An arbirtary min of 128 pages is used for when no hugepage sizes are set >>> enabled. >>> >>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx> >>> --- >>> include/linux/huge_mm.h | 25 +++++++++++++++++++++++++ >>> mm/khugepaged.c | 32 ++++++++++++++++++++++++++++---- >>> mm/shmem.c | 29 +++++------------------------ >>> 3 files changed, 58 insertions(+), 28 deletions(-) >>> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>> index 2f190c90192d..fb4e51ef0acb 100644 >>> --- a/include/linux/huge_mm.h >>> +++ b/include/linux/huge_mm.h >>> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item) >>> } >>> #endif >>> +/* >>> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option >>> + * >>> + * SHMEM_HUGE_NEVER: >>> + * disables huge pages for the mount; >>> + * SHMEM_HUGE_ALWAYS: >>> + * enables huge pages for the mount; >>> + * SHMEM_HUGE_WITHIN_SIZE: >>> + * only allocate huge pages if the page will be fully within i_size, >>> + * also respect madvise() hints; >>> + * SHMEM_HUGE_ADVISE: >>> + * only allocate huge pages if requested with madvise(); >>> + */ >>> + >>> + #define SHMEM_HUGE_NEVER 0 >>> + #define SHMEM_HUGE_ALWAYS 1 >>> + #define SHMEM_HUGE_WITHIN_SIZE 2 >>> + #define SHMEM_HUGE_ADVISE 3 >>> + >>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> extern unsigned long transparent_hugepage_flags; >>> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always; >>> extern unsigned long huge_anon_orders_madvise; >>> extern unsigned long huge_anon_orders_inherit; >>> +extern int shmem_huge __read_mostly; >>> +extern unsigned long huge_shmem_orders_always; >>> +extern unsigned long huge_shmem_orders_madvise; >>> +extern unsigned long huge_shmem_orders_inherit; >>> +extern unsigned long huge_shmem_orders_within_size; >> >> Do really all of these have to be exported? >> > > Hi David, > > Thanks for the review! > > For the RFC, I just did it similar to the anon ones when I got the build error > trying to use these, but yeah a much better approach would be to just have a > function in shmem that would return the largest shmem thp allowable order. > >>> + >>> static inline bool hugepage_global_enabled(void) >>> { >>> return transparent_hugepage_flags & >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index 15203ea7d007..e64cba74eb2a 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none) >>> return 0; >>> } >>> +static int thp_highest_allowable_order(void) >> >> Did you mean "largest" ? > > Yes > >> >>> +{ >>> + unsigned long orders = READ_ONCE(huge_anon_orders_always) >>> + | READ_ONCE(huge_anon_orders_madvise) >>> + | READ_ONCE(huge_shmem_orders_always) >>> + | READ_ONCE(huge_shmem_orders_madvise) >>> + | READ_ONCE(huge_shmem_orders_within_size); >>> + if (hugepage_global_enabled()) >>> + orders |= READ_ONCE(huge_anon_orders_inherit); >>> + if (shmem_huge != SHMEM_HUGE_NEVER) >>> + orders |= READ_ONCE(huge_shmem_orders_inherit); >>> + >>> + return orders == 0 ? 0 : fls(orders) - 1; >>> +} >> >> But how does this interact with large folios / THPs in the page cache? >> > > Yes this will be a problem. > > From what I see, there doesn't seem to be a max order for pagecache, only > mapping_set_folio_min_order for the min. Actually, there is one[1]. But it is limited by xas_split_alloc() and can be lifted once xas_split_alloc() is gone (implying READ_ONLY_THP_FOR_FS needs to go). [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pagemap.h#L377 > Does this mean that pagecache can fault in 128M, 256M, 512M large folios? > > I think this could increase the OOM rate significantly when ARM64 servers > are used with filesystems that support large folios.. > > Should there be an upper limit for pagecache? If so, it would either be a new > sysfs entry (which I dont like :( ) or just try and reuse the existing entries > with something like thp_highest_allowable_order? MAX_PAGECACHE_ORDER limits the max folio size at the moment in theory and the readahead code only reads PMD level folios at max IIRC. -- Best Regards, Yan, Zi