Re: [PATCH v11 00/15] khugepaged: mTHP support

Nico Pache <npache@xxxxxxxxxx> · Fri, 12 Sep 2025 18:18:08 -0600

On Fri, Sep 12, 2025 at 11:53 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 12.09.25 17:51, Lorenzo Stoakes wrote:
> > On Fri, Sep 12, 2025 at 05:45:26PM +0200, David Hildenbrand wrote:
> >> On 12.09.25 17:41, Kiryl Shutsemau wrote:
> >>> On Fri, Sep 12, 2025 at 04:56:47PM +0200, David Hildenbrand wrote:
> >>>> On 12.09.25 16:35, Kiryl Shutsemau wrote:
> >>>>> On Fri, Sep 12, 2025 at 04:28:09PM +0200, David Hildenbrand wrote:
> >>>>>> On 12.09.25 15:47, David Hildenbrand wrote:
> >>>>>>> On 12.09.25 14:19, Kiryl Shutsemau wrote:
> >>>>>>>> On Thu, Sep 11, 2025 at 09:27:55PM -0600, Nico Pache wrote:
> >>>>>>>>> The following series provides khugepaged with the capability to collapse
> >>>>>>>>> anonymous memory regions to mTHPs.
> >>>>>>>>>
> >>>>>>>>> To achieve this we generalize the khugepaged functions to no longer depend
> >>>>>>>>> on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
> >>>>>>>>> pages that are occupied (!none/zero). After the PMD scan is done, we do
> >>>>>>>>> binary recursion on the bitmap to find the optimal mTHP sizes for the PMD
> >>>>>>>>> range. The restriction on max_ptes_none is removed during the scan, to make
> >>>>>>>>> sure we account for the whole PMD range. When no mTHP size is enabled, the
> >>>>>>>>> legacy behavior of khugepaged is maintained. max_ptes_none will be scaled
> >>>>>>>>> by the attempted collapse order to determine how full a mTHP must be to be
> >>>>>>>>> eligible for the collapse to occur. If a mTHP collapse is attempted, but
> >>>>>>>>> contains swapped out, or shared pages, we don't perform the collapse. It is
> >>>>>>>>> now also possible to collapse to mTHPs without requiring the PMD THP size
> >>>>>>>>> to be enabled.
> >>>>>>>>>
> >>>>>>>>> When enabling (m)THP sizes, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on
> >>>>>>>>> 4K page size), it will be automatically capped to HPAGE_PMD_NR/2 - 1 for
> >>>>>>>>> mTHP collapses to prevent collapse "creep" behavior. This prevents
> >>>>>>>>> constantly promoting mTHPs to the next available size, which would occur
> >>>>>>>>> because a collapse introduces more non-zero pages that would satisfy the
> >>>>>>>>> promotion condition on subsequent scans.
> >>>>>>>>
> >>>>>>>> Hm. Maybe instead of capping at HPAGE_PMD_NR/2 - 1 we can count
> >>>>>>>> all-zeros 4k as none_or_zero? It mirrors the logic of shrinker.
> >>>>>>>
> >>>>>>> BTW, I thought further about this and I agree: if we count zero-filled
> >>>>>>> pages towards none_or_zero one we can avoid the "creep" problem.
> >>>>>>>
> >>>>>>> The scanning-for-zero part is rather nasty, though.
> >>>>>>
> >>>>>> Aaand, thinking again from the other direction, this would mean that just
> >>>>>> because pages became zero after some time that we would no longer collapse
> >>>>>> because none_or_zero would then be higher. Hm ....
> >>>>>>
> >>>>>> How I hate all of this so very very much :)
> >>>>>
> >>>>> This is not new. Shrinker has the same problem: it cannot distinguish
> >>>>> between hot 4k that happened to be zero from the 4k that is there just
> >>>>> because of we faulted in 2M a time.
> >>>>
> >>>> Right. And so far that problem is isolated to the shrinker.
> >>>>
> >>>> To me so far "none_or_zero" really meant "will I consume more memory when
> >>>> collapsing". That's not true for zero-filled pages, obviously.
> >>>
> >>> Well, KSM can reclaim these zero-filled memory until we collapse it.
> >>
> >> KSM is used so rarely (for good reasons) that I would never ever build an
> >> argument based on its existence :P
> >>
> >> But yes: during the very first shrinker discussion I raised that KSM can do
> >> the same thing. Obviously that was not good enough.
> >>
> >> --
> >> Cheers
> >>
> >> David / dhildenb
> >>
> >
> > With all this stuff said, do we have an actual plan for what we intend to do
> > _now_?
>
> Oh no, no I have to use my brain and it's Friday evening.
>
> >
> > As Nico has implemented a basic solution here that we all seem to agree is not
> > what we want.
> >
> > Without needing special new hardware or major reworks, what would this parameter
> > look like?
> >
> > What would the heuristics be? What about the eagerness scales?
> >
> > I'm but a simple kernel developer,
>
> :)
>
> and interested in simple pragmatic stuff :)
> > do you have a plan right now David?
>
> Ehm, if you ask me that way ...
>
> >
> > Maybe we can start with something simple like a rough percentage per eagerness
> > entry that then gets scaled based on utilisation?
>
> ... I think we should probably:
>
> 1) Start with something very simple for mTHP that doesn't lock us into any particular direction.
>
> 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well
I think the best design is to map to different max_ptes_none values,
0-5: 0,32,64,128,255,511
>
> 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever
>
> 4) Solve world peace and world hunger
>
> 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever
>
>
> I maintain my initial position that just using
>
> max_ptes_none == 511 -> collapse mTHP always
> max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero
I think we should implement the eagerness toggle, and map it to
different max_pte_none values like I described above. This fits nicely
in the current collapse_max_ptes_none() function.
If we go with just 0/511 without the eagerness changes, we will be
removing configurability, only to reintroduce it again. When we can
leave the configurability from the start.
>
> As a starting point is probably simple and best, and likely leaves room for any
> changes later.
>
>
> Of course, we could do what Nico is proposing here, as 1) and change it all later.
I dont think this is much different than the eagerness approach; it
just compresses the max_ptes_none from 0-512 to 0-5/10.

I will wait for your RFC for the next version.

Does your implementation/thoughts align with what I describe above?
>
> It's just when it comes to documenting all that stuff in patch #15 that I feel like
> "alright, we shouldn't be doing it longterm like that, so let's not make anybody
> depend on any weird behavior here by over-domenting it".
>
> I mean
>
> "
> +To prevent "creeping" behavior where collapses continuously promote to larger
> +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
> +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
> +that introducing more than half of the pages to be non-zero it will always
> +satisfy the eligibility check on the next scan and the region will be collapse.
> "
>
> Is just way, way to detailed.
>
> I would just say "The kernel might decide to use a more conservative approach
> when collapsing smaller THPs" etc.

Sounds good I can make it more ambiguous!

Cheers.
-- Nico
>
>
> Thoughts?
>
> --
> Cheers
>
> David / dhildenb
>