On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote: > On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote: >> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote: >> >>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote: >>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote: >>>> >>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote: >>> >>> [snip] >>> >>>>>> So I guess the question is what should be the next step? The following has been discussed: >>>>>> >>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above >>>>>> and might have unintended consequences if done at runtime, so a no go? >>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi). >>>>>> The decoupling can be done separately. Watermark calculation can be decoupled using the >>>>>> approach taken in this RFC. Although max order used by pagecache needs to be addressed. >>>>>> >>>>> >>>>> I need to catch up with the thread (workload crazy atm), but why isn't it >>>>> feasible to simply statically adjust the pageblock size? >>>>> >>>>> The whole point of 'defragmentation' is to _heuristically_ make it less >>>>> likely there'll be fragmentation when requesting page blocks. >>>>> >>>>> And the watermark code is explicitly about providing reserves at a >>>>> _pageblock granularity_. >>>>> >>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks >>>>> that we rarely use? >>>>> >>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible >>>>> level no? >>>> >>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock >>>> at 2MB, everyone should be happy at the moment. But if one user wants to >>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock >>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB >>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job. >>>> >>>> Defragmentation has two components: 1) pageblock, which has migratetypes >>>> to prevent mixing movable and unmovable pages, as a single unmovable page >>>> blocks large free pages from being created; 2) memory compaction granularity, >>>> which is the actual work to move pages around and form a large free pages. >>>> Currently, kernel assumes pageblock size = defragmentation granularity, >>>> but in reality, as long as pageblock size >= defragmentation granularity, >>>> memory compaction would still work, but not the other way around. So we >>>> need to choose pageblock size carefully to not break memory compaction. >>> >>> OK I get it - the issue is that compaction itself operations at a pageblock >>> granularity, and once you get so fragmented that compaction is critical to >>> defragmentation, you are stuck if the pageblock is not big enough. >> >> Right. >> >>> >>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in >>> itself, and if we're complaining about issues with unavailable reserved >>> memory due to crazy PMD size, surely one will encounter the compaction >>> process simply failing to succeed/taking forever/causing issues with >>> reclaim/higher order folio allocation. >> >> Yep. Initially, we probably never thought PMD THP would be as large as >> 512MB. > > Of course, such is the 'organic' nature of kernel development :) > >> >>> >>> I mean, I don't really know the compaction code _at all_ (ran out of time >>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or >>> gives up? >> >> compaction works on one pageblock at a time, trying to migrate in-use pages >> within the pageblock away to create a free page for THP allocation. >> It assumes PMD THP size is equal to pageblock size. It will keep working >> until a PMD THP size free page is created. This is a very high level >> description, omitting a lot of details like how to avoid excessive compaction >> work, how to reduce compaction latency. > > Yeah this matches my assumptions. > >> >>> >>> Because it strikes me that a crazy pageblock size would cause really >>> serious system issues on that basis alone if that's the case. >>> >>> And again this leads me back to thinking it should just be the page block >>> size _as a whole_ that should be adjusted. >>> >>> Keep in mind a user can literally reduce the page block size already via >>> CONFIG_PAGE_BLOCK_MAX_ORDER. >>> >>> To me it seems that we should cap it at the highest _reasonable_ mTHP size >>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size >>> system. >>> >>> That way, people _can still get_ super huge PMD sized huge folios up to the >>> point of fragmentation. >>> >>> If we do reduce things this way we should give a config option to allow >>> users who truly want collosal PMD sizes with associated >>> watermarks/compaction to be able to still have it. >>> >>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something? >> >> I agree with capping pageblock size at a highest reasonable mTHP size. >> In case there is some user relying on this huge PMD THP, making >> pageblock a boot time variable might be a little better, since >> they do not need to recompile the kernel for their need, assuming >> distros will pick something like 2MB as the default pageblock size. > > Right, this seems sensible, as long as we set a _default_ that limits to > whatever it would be, 2MB or such. > > I don't think it's unreasonable to make that change since this 512 MB thing > is so entirely unexpected and unusual. > > I think Usama said it would be a pain it working this way if it had to be > explicitly set as a boot time variable without defaulting like this. > >> >>> >>> I also question this de-coupling in general (I may be missing somethig >>> however!) - the watermark code _very explicitly_ refers to providing >>> _pageblocks_ in order to ensure _defragmentation_ right? >> >> Yes. Since without enough free memory (bigger than a PMD THP), >> memory compaction will just do useless work. > > Yeah right, so this is a key thing and why we need to rework the current > state of the patch. > >> >>> >>> We would need to absolutely justify why it's suddenly ok to not provide >>> page blocks here. >>> >>> This is very very delicate code we have to be SO careful about. >>> >>> This is why I am being cautious here :) >> >> Understood. In theory, we can associate watermarks with THP allowed orders >> the other way around too, meaning if user lowers vm.min_free_kbytes, >> all THP/mTHP sizes bigger than the watermark threshold are disabled >> automatically. This could fix the memory compaction issues, but >> that might also drive user crazy as they cannot use the THP sizes >> they want. > > Yeah that's interesting but I think that's just far too subtle and people will > have no idea what's going on. > > I really think a hard cap, expressed in KB/MB, on pageblock size is the way to > go (but overrideable for people crazy enough to truly want 512 MB pages - and > who cannot then complain about watermarks). I agree. Basically, I am thinking: 1) use something like 2MB as default pageblock size for all arch (the value can be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default value; 2) make pageblock_order a boot time parameter, so that user who wants 512MB pages can still get it by changing pageblock order at boot time. WDYT? > >> >> Often, user just ask for an impossible combination: they >> want to use all free memory, because they paid for it, and they >> want THPs, because they want max performance. When PMD THP is >> small like 2MB, the “unusable” free memory is not that noticeable, >> but when PMD THP is as large as 512MB, user just cannot unsee it. :) > > Well, users asking for crazy things then being surprised when they get them > is nothing new :P > >> >> >> Best Regards, >> Yan, Zi > > Thanks for your input! > > Cheers, Lorenzo Best Regards, Yan, Zi