Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes

Usama Arif <usamaarif642@xxxxxxxxx> · Mon, 9 Jun 2025 12:34:25 +0100

On 06/06/2025 18:37, David Hildenbrand wrote:
> On 06.06.25 16:37, Usama Arif wrote:
>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>> watermarks are evaluated to extremely high values, for e.g. a server with
>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>> of the sizes set to never, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively.
>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>> and 1G respectively.
>> This is because set_recommended_min_free_kbytes is designed for PMD
>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>> Such high watermark values can cause performance and latency issues in
>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>> most of them would never actually use a 512M PMD THP.
>>
>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>> folio order enabled in set_recommended_min_free_kbytes.
>> With this patch, when only 2M THP hugepage size is set to madvise for the
>> same machine with 64K page size, with the rest of the sizes set to never,
>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>> respectively. When 512M THP hugepage size is set to madvise for the same
>> machine with 64K page size, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>
>> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
>> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
>> is not dynamic with hugepage size, will need different kernel builds for
>> different hugepage sizes and most users won't know that this needs to be
>> done as it can be difficult to detmermine that the performance and latency
>> issues are coming from the high watermark values.
>>
>> All watermark numbers are for zones of nodes that had the highest number
>> of pages, i.e. the value for min size for 4K is obtained using:
>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
>> and for 64K using:
>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>>
>> An arbirtary min of 128 pages is used for when no hugepage sizes are set
>> enabled.
>>
>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
>> ---
>>   include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>>   mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>>   mm/shmem.c              | 29 +++++------------------------
>>   3 files changed, 58 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..fb4e51ef0acb 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>>   }
>>   #endif
>>   +/*
>> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>> + *
>> + * SHMEM_HUGE_NEVER:
>> + *    disables huge pages for the mount;
>> + * SHMEM_HUGE_ALWAYS:
>> + *    enables huge pages for the mount;
>> + * SHMEM_HUGE_WITHIN_SIZE:
>> + *    only allocate huge pages if the page will be fully within i_size,
>> + *    also respect madvise() hints;
>> + * SHMEM_HUGE_ADVISE:
>> + *    only allocate huge pages if requested with madvise();
>> + */
>> +
>> + #define SHMEM_HUGE_NEVER    0
>> + #define SHMEM_HUGE_ALWAYS    1
>> + #define SHMEM_HUGE_WITHIN_SIZE    2
>> + #define SHMEM_HUGE_ADVISE    3
>> +
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>     extern unsigned long transparent_hugepage_flags;
>> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>>   extern unsigned long huge_anon_orders_madvise;
>>   extern unsigned long huge_anon_orders_inherit;
>>   +extern int shmem_huge __read_mostly;
>> +extern unsigned long huge_shmem_orders_always;
>> +extern unsigned long huge_shmem_orders_madvise;
>> +extern unsigned long huge_shmem_orders_inherit;
>> +extern unsigned long huge_shmem_orders_within_size;
> 
> Do really all of these have to be exported?
> 

Hi David,

Thanks for the review!

For the RFC, I just did it similar to the anon ones when I got the build error
trying to use these, but yeah a much better approach would be to just have a
function in shmem that would return the largest shmem thp allowable order.

>> +
>>   static inline bool hugepage_global_enabled(void)
>>   {
>>       return transparent_hugepage_flags &
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 15203ea7d007..e64cba74eb2a 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>>       return 0;
>>   }
>>   +static int thp_highest_allowable_order(void)
> 
> Did you mean "largest" ?

Yes

> 
>> +{
>> +    unsigned long orders = READ_ONCE(huge_anon_orders_always)
>> +                   | READ_ONCE(huge_anon_orders_madvise)
>> +                   | READ_ONCE(huge_shmem_orders_always)
>> +                   | READ_ONCE(huge_shmem_orders_madvise)
>> +                   | READ_ONCE(huge_shmem_orders_within_size);
>> +    if (hugepage_global_enabled())
>> +        orders |= READ_ONCE(huge_anon_orders_inherit);
>> +    if (shmem_huge != SHMEM_HUGE_NEVER)
>> +        orders |= READ_ONCE(huge_shmem_orders_inherit);
>> +
>> +    return orders == 0 ? 0 : fls(orders) - 1;
>> +}
> 
> But how does this interact with large folios / THPs in the page cache?
> 

Yes this will be a problem.