On Tue, Sep 2, 2025 at 2:23 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 02/09/2025 12:03, David Hildenbrand wrote: > > On 02.09.25 12:34, Usama Arif wrote: > >> > >> > >> On 02/09/2025 10:03, David Hildenbrand wrote: > >>> On 02.09.25 04:28, Baolin Wang wrote: > >>>> > >>>> > >>>> On 2025/9/2 00:46, David Hildenbrand wrote: > >>>>> On 29.08.25 03:55, Baolin Wang wrote: > >>>>>> > >>>>>> > >>>>>> On 2025/8/28 18:48, Dev Jain wrote: > >>>>>>> > >>>>>>> On 28/08/25 3:16 pm, Baolin Wang wrote: > >>>>>>>> (Sorry for chiming in late) > >>>>>>>> > >>>>>>>> On 2025/8/22 22:10, David Hildenbrand wrote: > >>>>>>>>>>> Once could also easily support the value 255 (HPAGE_PMD_NR / 2- 1), > >>>>>>>>>>> but not sure > >>>>>>>>>>> if we have to add that for now. > >>>>>>>>>> > >>>>>>>>>> Yeah not so sure about this, this is a 'just have to know' too, and > >>>>>>>>>> yes you > >>>>>>>>>> might add it to the docs, but people are going to be mightily > >>>>>>>>>> confused, esp if > >>>>>>>>>> it's a calculated value. > >>>>>>>>>> > >>>>>>>>>> I don't see any other way around having a separate tunable if we > >>>>>>>>>> don't just have > >>>>>>>>>> something VERY simple like on/off. > >>>>>>>>> > >>>>>>>>> Yeah, not advocating that we add support for other values than 0/511, > >>>>>>>>> really. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Also the mentioned issue sounds like something that needs to be > >>>>>>>>>> fixed elsewhere > >>>>>>>>>> honestly in the algorithm used to figure out mTHP ranges (I may be > >>>>>>>>>> wrong - and > >>>>>>>>>> happy to stand corrected if this is somehow inherent, but reallly > >>>>>>>>>> feels that > >>>>>>>>>> way). > >>>>>>>>> > >>>>>>>>> I think the creep is unavoidable for certain values. > >>>>>>>>> > >>>>>>>>> If you have the first two pages of a PMD area populated, and you > >>>>>>>>> allow for at least half of the #PTEs to be non/zero, you'd collapse > >>>>>>>>> first a > >>>>>>>>> order-2 folio, then and order-3 ... until you reached PMD order. > >>>>>>>>> > >>>>>>>>> So for now we really should just support 0 / 511 to say "don't > >>>>>>>>> collapse if there are holes" vs. "always collapse if there is at > >>>>>>>>> least one pte used". > >>>>>>>> > >>>>>>>> If we only allow setting 0 or 511, as Nico mentioned before, "At 511, > >>>>>>>> no mTHP collapses would ever occur anyway, unless you have 2MB > >>>>>>>> disabled and other mTHP sizes enabled. Technically, at 511, only the > >>>>>>>> highest enabled order would ever be collapsed." > >>>>>>> I didn't understand this statement. At 511, mTHP collapses will occur if > >>>>>>> khugepaged cannot get a PMD folio. Our goal is to collapse to the > >>>>>>> highest order folio. > >>>>>> > >>>>>> Yes, I’m not saying that it’s incorrect behavior when set to 511. What I > >>>>>> mean is, as in the example I gave below, users may only want to allow a > >>>>>> large order collapse when the number of present PTEs reaches half of the > >>>>>> large folio, in order to avoid RSS bloat. > >>>>> > >>>>> How do these users control allocation at fault time where this parameter > >>>>> is completely ignored? > >>>> > >>>> Sorry, I did not get your point. Why does the 'max_pte_none' need to > >>>> control allocation at fault time? Could you be more specific? Thanks. > >>> > >>> The comment over khugepaged_max_ptes_none gives a hint: > >>> > >>> /* > >>> * default collapse hugepages if there is at least one pte mapped like > >>> * it would have happened if the vma was large enough during page > >>> * fault. > >>> * > >>> * Note that these are only respected if collapse was initiated by khugepaged. > >>> */ > >>> > >>> In the common case (for anything that really cares about RSS bloat) you will just a > >>> get a THP during page fault and consequently RSS bloat. > >>> > >>> As raised in my other reply, the only documented reason to set max_ptes_none=0 seems > >>> to be when an application later (after once possibly getting a THP already during > >>> page faults) did some MADV_DONTNEED and wants to control the usage of THPs itself using > >>> MADV_COLLAPSE. > >>> > >>> It's a questionable use case, that already got more problematic with mTHP and page > >>> table reclaim. > >>> > >>> Let me explain: > >>> > >>> Before mTHP, if someone would MADV_DONTNEED (resulting in > >>> a page table with at least one pte_none entry), there would have been no way we would > >>> get memory over-allocated afterwards with max_ptes_none=0. > >>> > >>> (1) Page faults would spot "there is a page table" and just fallback to order-0 pages. > >>> (2) khugepaged was told to not collapse through max_ptes_none=0. > >>> > >>> But now: > >>> > >>> (A) With mTHP during page-faults, we can just end up over-allocating memory in such > >>> an area again: page faults will simply spot a bunch of pte_nones around the fault area > >>> and install an mTHP. > >>> > >>> (B) With page table reclaim (when zapping all PTEs in a table at once), we will reclaim the > >>> page table. The next page fault will just try installing a PMD THP again, because there is > >>> no PTE table anymore. > >>> > >>> So I question the utility of max_ptes_none. If you can't tame page faults, then there is only > >>> limited sense in taming khugepaged. I think there is vale in setting max_ptes_none=0 for some > >>> corner cases, but I am yet to learn why max_ptes_none=123 would make any sense. > >>> > >>> > >> > >> For PMD mapped THPs with THP shrinker, this has changed. You can basically tame pagefaults, as when you encounter > >> memory pressure, the shrinker kicks in if the value is less than HPAGE_PMD_NR -1 (i.e. 511 for x86), and > >> will break down those hugepages and free up zero-filled memory. > > > > You are not really taming page faults, though, you are undoing what page faults might have messed up :) > > > > I have seen in our prod workloads where > >> the memory usage and THP usage can spike (usually when the workload starts), but with memory pressure, > >> the memory usage is lower compared to with max_ptes_none = 511, while still still keeping the benefits > >> of THPs like lower TLB misses. > > > > Thanks for raising that: I think the current behavior is in place such that you don't bounce back-and-forth between khugepaged collapse and shrinker-split. > > > > Yes, both collapse and shrinker split hinge on max_ptes_none to prevent one of these things thrashing the effect of the other. I believe with mTHP support in khugepaged, the max_ptes_none value in the shrinker must also leverage the 'order' scaling to properly prevent thrashing. I've been testing a patch for this that I might include in the V11. > > > There are likely other ways to achieve that, when we have in mind that the thp shrinker will install zero pages and max_ptes_none includes > > zero pages. > > > >> > >> I do agree that the value of max_ptes_none is magical and different workloads can react very differently > >> to it. The relationship is definitely not linear. i.e. if I use max_ptes_none = 256, it does not mean > >> that the memory regression of using THP=always vs THP=madvise is halved. > > > > To which value would you set it? Just 510? 0? > > > > There are some very large workloads in the meta fleet that I experimented with and found that having > a small value works out. I experimented with 0, 51 (10%) and 256 (50%). 51 was found to be an optimal > comprimise in terms of application metrics improving, having an acceptable amount of memory regression and > improved system level metrics (lower TLB misses, lower page faults). I am sure there was a better value out > there for these workloads, but not possible to experiment with every value. > > In terms of wider rollout across the fleet, we are going to target 0 (or a very very small value) > when moving from THP=madvise to always. Mainly because it is the least likely to cause a memory regression as > THP shrinker will deal with page faults faulting in mostly zero-filled pages and khugepaged wont collapse > pages that are dominated by 4K zero-filled chunks. >