Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes

Zi Yan <ziy@xxxxxxxxxx> · Mon, 09 Jun 2025 09:28:06 -0400

On 9 Jun 2025, at 7:34, Usama Arif wrote:

> On 06/06/2025 18:37, David Hildenbrand wrote:
>> On 06.06.25 16:37, Usama Arif wrote:
>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>> 11.2G, 14G and 16.8G respectively.
>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>> and 1G respectively.
>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>> Such high watermark values can cause performance and latency issues in
>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>> most of them would never actually use a 512M PMD THP.
>>>
>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>> folio order enabled in set_recommended_min_free_kbytes.
>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>> same machine with 64K page size, with the rest of the sizes set to never,
>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>
>>> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
>>> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
>>> is not dynamic with hugepage size, will need different kernel builds for
>>> different hugepage sizes and most users won't know that this needs to be
>>> done as it can be difficult to detmermine that the performance and latency
>>> issues are coming from the high watermark values.
>>>
>>> All watermark numbers are for zones of nodes that had the highest number
>>> of pages, i.e. the value for min size for 4K is obtained using:
>>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
>>> and for 64K using:
>>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>>>
>>> An arbirtary min of 128 pages is used for when no hugepage sizes are set
>>> enabled.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
>>> ---
>>>   include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>>>   mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>>>   mm/shmem.c              | 29 +++++------------------------
>>>   3 files changed, 58 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2f190c90192d..fb4e51ef0acb 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>>>   }
>>>   #endif
>>>   +/*
>>> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>>> + *
>>> + * SHMEM_HUGE_NEVER:
>>> + *    disables huge pages for the mount;
>>> + * SHMEM_HUGE_ALWAYS:
>>> + *    enables huge pages for the mount;
>>> + * SHMEM_HUGE_WITHIN_SIZE:
>>> + *    only allocate huge pages if the page will be fully within i_size,
>>> + *    also respect madvise() hints;
>>> + * SHMEM_HUGE_ADVISE:
>>> + *    only allocate huge pages if requested with madvise();
>>> + */
>>> +
>>> + #define SHMEM_HUGE_NEVER    0
>>> + #define SHMEM_HUGE_ALWAYS    1
>>> + #define SHMEM_HUGE_WITHIN_SIZE    2
>>> + #define SHMEM_HUGE_ADVISE    3
>>> +
>>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>     extern unsigned long transparent_hugepage_flags;
>>> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>>>   extern unsigned long huge_anon_orders_madvise;
>>>   extern unsigned long huge_anon_orders_inherit;
>>>   +extern int shmem_huge __read_mostly;
>>> +extern unsigned long huge_shmem_orders_always;
>>> +extern unsigned long huge_shmem_orders_madvise;
>>> +extern unsigned long huge_shmem_orders_inherit;
>>> +extern unsigned long huge_shmem_orders_within_size;
>>
>> Do really all of these have to be exported?
>>
>
> Hi David,
>
> Thanks for the review!
>
> For the RFC, I just did it similar to the anon ones when I got the build error
> trying to use these, but yeah a much better approach would be to just have a
> function in shmem that would return the largest shmem thp allowable order.
>
>>> +
>>>   static inline bool hugepage_global_enabled(void)
>>>   {
>>>       return transparent_hugepage_flags &
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 15203ea7d007..e64cba74eb2a 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>>>       return 0;
>>>   }
>>>   +static int thp_highest_allowable_order(void)
>>
>> Did you mean "largest" ?
>
> Yes
>
>>
>>> +{
>>> +    unsigned long orders = READ_ONCE(huge_anon_orders_always)
>>> +                   | READ_ONCE(huge_anon_orders_madvise)
>>> +                   | READ_ONCE(huge_shmem_orders_always)
>>> +                   | READ_ONCE(huge_shmem_orders_madvise)
>>> +                   | READ_ONCE(huge_shmem_orders_within_size);
>>> +    if (hugepage_global_enabled())
>>> +        orders |= READ_ONCE(huge_anon_orders_inherit);
>>> +    if (shmem_huge != SHMEM_HUGE_NEVER)
>>> +        orders |= READ_ONCE(huge_shmem_orders_inherit);
>>> +
>>> +    return orders == 0 ? 0 : fls(orders) - 1;
>>> +}
>>
>> But how does this interact with large folios / THPs in the page cache?
>>
>
> Yes this will be a problem.
>
> From what I see, there doesn't seem to be a max order for pagecache, only
> mapping_set_folio_min_order for the min.

Actually, there is one[1]. But it is limited by xas_split_alloc() and
can be lifted once xas_split_alloc() is gone (implying READ_ONLY_THP_FOR_FS
needs to go).

[1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pagemap.h#L377

> Does this mean that pagecache can fault in 128M, 256M, 512M large folios?
>
> I think this could increase the OOM rate significantly when ARM64 servers
> are used with filesystems that support large folios..
>
> Should there be an upper limit for pagecache? If so, it would either be a new
> sysfs entry (which I dont like :( ) or just try and reuse the existing entries
> with something like thp_highest_allowable_order?

MAX_PAGECACHE_ORDER limits the max folio size at the moment in theory
and the readahead code only reads PMD level folios at max IIRC.

--
Best Regards,
Yan, Zi