Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Mon, 9 Jun 2025 15:16:29 +0100

On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>
>
> On 09/06/2025 14:19, Zi Yan wrote:
> > On 9 Jun 2025, at 7:13, Usama Arif wrote:
> >
> >> On 06/06/2025 17:10, Zi Yan wrote:
> >>> On 6 Jun 2025, at 11:38, Usama Arif wrote:
> >>>
> >>>> On 06/06/2025 16:18, Zi Yan wrote:
> >>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
> >>>>>
> >>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> >>>>>> watermarks are evaluated to extremely high values, for e.g. a server with
> >>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> >>>>>> of the sizes set to never, the min, low and high watermarks evaluate to
> >>>>>> 11.2G, 14G and 16.8G respectively.
> >>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> >>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> >>>>>> and 1G respectively.
> >>>>>> This is because set_recommended_min_free_kbytes is designed for PMD
> >>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> >>>>>> Such high watermark values can cause performance and latency issues in
> >>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> >>>>>> most of them would never actually use a 512M PMD THP.
> >>>>>>
> >>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> >>>>>> folio order enabled in set_recommended_min_free_kbytes.
> >>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
> >>>>>> same machine with 64K page size, with the rest of the sizes set to never,
> >>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> >>>>>> respectively. When 512M THP hugepage size is set to madvise for the same
> >>>>>> machine with 64K page size, the min, low and high watermarks evaluate to
> >>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
> >>>>>
> >>>>> Getting pageblock_order involved here might be confusing. I think you just
> >>>>> want to adjust min, low and high watermarks to reasonable values.
> >>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
> >>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
> >>>>> look reasonable to me.
> >>>>
> >>>> Hi Zi,
> >>>>
> >>>> Thanks for the review!
> >>>>
> >>>> I forgot to change it in another place, sorry about that! So can't move
> >>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
> >>>> Have added the additional place where min_thp_pageblock_nr_pages() is called
> >>>> as a fixlet here:
> >>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@xxxxxxxxx/
> >>>>
> >>>> I think atleast in this context the orginal name pageblock_nr_pages isn't
> >>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
> >>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
> >>>> to change it to something appropriate.
> >>>
> >>> Got it. pageblock is the defragmentation granularity. If user only wants
> >>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
> >>> kernel will defragment at 512MB granularity, which might not be efficient.
> >>> Maybe make pageblock_order a boot time parameter?
> >>>
> >>> In addition, we are mixing two things together:
> >>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
> >>>    will be triggered;
> >>> 2. pageblock order: it is the granularity of defragmentation for creating
> >>>    mTHP/THP.
> >>>
> >>> In your use case, you want to lower watermarks, right? Considering what you
> >>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
> >>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
> >>> is lame, sorry).
> >>>
> >>> I think for 2, we might want to decouple pageblock order from defragmentation
> >>> granularity.
> >>>
> >>
> >> This is a good point. I only did it for the watermarks in the RFC, but there
> >> is no reason that the defrag granularity is done in 512M chunks and is probably
> >> very inefficient to do so?
> >>
> >> Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes,
> >> maybe we just need to change the definition of pageblock_order in [1] to take into
> >> account the highest large folio order enabled instead of HPAGE_PMD_ORDER?
> >
> > Ideally, yes. But pageblock migratetypes are stored in a fixed size array
> > determined by pageblock_order at boot time (see usemap_size() in mm/mm_init.c).
> > Changing pageblock_order at runtime means we will need to resize pageblock
> > migratetypes array, which is a little unrealistic. In a system with GBs or TBs
> > memory, reducing pageblock_order by 1 means doubling pageblock migratetypes
> > array and replicating one pageblock migratetypes to two; increasing pageblock
> > order by 1 means halving the array and splitting a pageblock into two.
> > The former, if memory is enough, might be easy, but the latter is a little
> > involved, since for a pageblock with both movable and unmovable pages,
> > you will need to check all pages to decide the migratetypes of the after-split
> > pageblocks to make sure pageblock migratetype matches the pages inside that
> > pageblock.
> >
>
> Thanks for explaining this so well and the code pointer!
>
> Yeah it doesnt seem reasonable to change the size of pageblock_flags at runtime.
> >
> >>
> >> [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50
> >>
> >> I really want to avoid coming up with a solution that requires changing a Kconfig or needs
> >> kernel commandline to change. It would mean a reboot whenever a different workload
> >> runs on a server that works optimally with a different THP size, and that would make
> >> workload orchestration a nightmare.
> >>
> >
> > As I said above, changing pageblock order at runtime might not be easy. But
> > changing defragmentation granularity should be fine, since it just changes
> > the range of memory compaction. That is the reason of my proposal,
> > decoupling pageblock order from defragmentation granularity. We probably
> > need to do some experiments to see the impact of the decoupling, as I
> > imagine defragmenting a range smaller than pageblock order is fine, but
> > defragmenting a range larger than pageblock order might cause issues
> > if there is any unmovable pageblock within that range. Since it is very likely
> > unmovable pages reside in an unmovable pageblock and lead to a defragmentation
> > failure.
> >
> >
>
> I saw you mentioned of a proposal to decouple pageblock order from defrag granularity
> in one of the other replies as well, just wanted to check if there was anything you had
> sent in lore in terms of proposal or RFC that I could look at.
>
> So I guess the question is what should be the next step? The following has been discussed:
>
> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>   and might have unintended consequences if done at runtime, so a no go?
> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>

I need to catch up with the thread (workload crazy atm), but why isn't it
feasible to simply statically adjust the pageblock size?

The whole point of 'defragmentation' is to _heuristically_ make it less
likely there'll be fragmentation when requesting page blocks.

And the watermark code is explicitly about providing reserves at a
_pageblock granularity_.

Why would we want to 'defragment' to 512MB physically contiguous chunks
that we rarely use?

Since it's all heuristic, it seems reasonable to me to cap it at a sensible
level no?

>
> > --
> > Best Regards,
> > Yan, Zi
>