On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > > On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote: > > On 12.09.25 17:51, Lorenzo Stoakes wrote: > > > With all this stuff said, do we have an actual plan for what we intend to do > > > _now_? > > > > Oh no, no I have to use my brain and it's Friday evening. > > I apologise :) > > > > > > > > > As Nico has implemented a basic solution here that we all seem to agree is not > > > what we want. > > > > > > Without needing special new hardware or major reworks, what would this parameter > > > look like? > > > > > > What would the heuristics be? What about the eagerness scales? > > > > > > I'm but a simple kernel developer, > > > > :) > > > > and interested in simple pragmatic stuff :) > > > do you have a plan right now David? > > > > Ehm, if you ask me that way ... > > > > > > > > Maybe we can start with something simple like a rough percentage per eagerness > > > entry that then gets scaled based on utilisation? > > > > ... I think we should probably: > > > > 1) Start with something very simple for mTHP that doesn't lock us into any particular direction. > > Yes. > > > > > 2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well > > Yes I think we're all pretty onboard with that it seems! > > > > > 3) Improve that "eagerness" algorithm using a dynamic scale or #whatever > > Right, I feel like we could start with some very simple linear thing here and > later maybe refine it? I agree, something like 0,32,64,128,255,511 seem to map well, and is not too different from what im doing with the scaling by (HPAGE_PMD_ORDER - order). > > > > > 4) Solve world peace and world hunger > > Yes! That would be pretty great ;) This should probably be a larger priority > > > > > 5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever > > I think these are TODOs :) > > > > > > > I maintain my initial position that just using > > > > max_ptes_none == 511 -> collapse mTHP always > > max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero > > > > As a starting point is probably simple and best, and likely leaves room for any > > changes later. > > Yes. > > > > > > > Of course, we could do what Nico is proposing here, as 1) and change it all later. > > Right. > > But that does mean for mTHP we're limited to 256 (or 255 was it?) but I guess > given the 'creep' issue that's sensible. I dont think thats much different to what david is trying to propose, given eagerness=9 would be 50%. at 10 or 511, no matter what, you will only ever collapse to the largest enabled order. The difference in my approach is that technically, with PMD disabled, and 511, you would still need 50% utilization to collapse, which is not ideal if you always want to collapse to some mTHP size even with 1 page occupied. With davids solution this is solved by never allowing anything in between 255-511. > > > > > It's just when it comes to documenting all that stuff in patch #15 that I feel like > > "alright, we shouldn't be doing it longterm like that, so let's not make anybody > > depend on any weird behavior here by over-domenting it". > > > > I mean > > > > " > > +To prevent "creeping" behavior where collapses continuously promote to larger > > +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is > > +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact > > +that introducing more than half of the pages to be non-zero it will always > > +satisfy the eligibility check on the next scan and the region will be collapse. > > " > > > > Is just way, way to detailed. > > > > I would just say "The kernel might decide to use a more conservative approach > > when collapsing smaller THPs" etc. > > > > > > Thoughts? > > Well I've sort of reviewed oppositely there :) well at least that it needs to be > a hell of a lot clearer (I find that comment really compressed and I just don't > really understand it). I think your review is still valid to improve the internal code comment. I think David is suggesting to not be so specific in the actual admin-guide docs as we move towards a more opaque tunable. > > I guess I didn't think about people reading that and relying on it, so maybe we > could alternatively make that succinct. > > But I think it'd be better to say something like "mTHP collapse cannot currently > correctly function with half or more of the PTE entries empty, so we cap at just > below this level" in this case. Some middle ground might be the best answer, not too specific, but also allude to the interworking a little. Cheers, -- Nico > > > > > -- > > Cheers > > > > David / dhildenb > > > > Cheers, Lorenzo >