On 9 Jun 2025, at 16:03, Usama Arif wrote: > On 09/06/2025 20:49, Zi Yan wrote: >> On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote: >> >>> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote: >>>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote: >>>> >>>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote: >>>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote: >>>>>> >>>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote: >>>>> >>>>> [snip] >>>>> >>>>>>>> So I guess the question is what should be the next step? The following has been discussed: >>>>>>>> >>>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above >>>>>>>> and might have unintended consequences if done at runtime, so a no go? >>>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi). >>>>>>>> The decoupling can be done separately. Watermark calculation can be decoupled using the >>>>>>>> approach taken in this RFC. Although max order used by pagecache needs to be addressed. >>>>>>>> >>>>>>> >>>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it >>>>>>> feasible to simply statically adjust the pageblock size? >>>>>>> >>>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less >>>>>>> likely there'll be fragmentation when requesting page blocks. >>>>>>> >>>>>>> And the watermark code is explicitly about providing reserves at a >>>>>>> _pageblock granularity_. >>>>>>> >>>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks >>>>>>> that we rarely use? >>>>>>> >>>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible >>>>>>> level no? >>>>>> >>>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock >>>>>> at 2MB, everyone should be happy at the moment. But if one user wants to >>>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock >>>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB >>>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job. >>>>>> >>>>>> Defragmentation has two components: 1) pageblock, which has migratetypes >>>>>> to prevent mixing movable and unmovable pages, as a single unmovable page >>>>>> blocks large free pages from being created; 2) memory compaction granularity, >>>>>> which is the actual work to move pages around and form a large free pages. >>>>>> Currently, kernel assumes pageblock size = defragmentation granularity, >>>>>> but in reality, as long as pageblock size >= defragmentation granularity, >>>>>> memory compaction would still work, but not the other way around. So we >>>>>> need to choose pageblock size carefully to not break memory compaction. >>>>> >>>>> OK I get it - the issue is that compaction itself operations at a pageblock >>>>> granularity, and once you get so fragmented that compaction is critical to >>>>> defragmentation, you are stuck if the pageblock is not big enough. >>>> >>>> Right. >>>> >>>>> >>>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in >>>>> itself, and if we're complaining about issues with unavailable reserved >>>>> memory due to crazy PMD size, surely one will encounter the compaction >>>>> process simply failing to succeed/taking forever/causing issues with >>>>> reclaim/higher order folio allocation. >>>> >>>> Yep. Initially, we probably never thought PMD THP would be as large as >>>> 512MB. >>> >>> Of course, such is the 'organic' nature of kernel development :) >>> >>>> >>>>> >>>>> I mean, I don't really know the compaction code _at all_ (ran out of time >>>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or >>>>> gives up? >>>> >>>> compaction works on one pageblock at a time, trying to migrate in-use pages >>>> within the pageblock away to create a free page for THP allocation. >>>> It assumes PMD THP size is equal to pageblock size. It will keep working >>>> until a PMD THP size free page is created. This is a very high level >>>> description, omitting a lot of details like how to avoid excessive compaction >>>> work, how to reduce compaction latency. >>> >>> Yeah this matches my assumptions. >>> >>>> >>>>> >>>>> Because it strikes me that a crazy pageblock size would cause really >>>>> serious system issues on that basis alone if that's the case. >>>>> >>>>> And again this leads me back to thinking it should just be the page block >>>>> size _as a whole_ that should be adjusted. >>>>> >>>>> Keep in mind a user can literally reduce the page block size already via >>>>> CONFIG_PAGE_BLOCK_MAX_ORDER. >>>>> >>>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size >>>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size >>>>> system. >>>>> >>>>> That way, people _can still get_ super huge PMD sized huge folios up to the >>>>> point of fragmentation. >>>>> >>>>> If we do reduce things this way we should give a config option to allow >>>>> users who truly want collosal PMD sizes with associated >>>>> watermarks/compaction to be able to still have it. >>>>> >>>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something? >>>> >>>> I agree with capping pageblock size at a highest reasonable mTHP size. >>>> In case there is some user relying on this huge PMD THP, making >>>> pageblock a boot time variable might be a little better, since >>>> they do not need to recompile the kernel for their need, assuming >>>> distros will pick something like 2MB as the default pageblock size. >>> >>> Right, this seems sensible, as long as we set a _default_ that limits to >>> whatever it would be, 2MB or such. >>> >>> I don't think it's unreasonable to make that change since this 512 MB thing >>> is so entirely unexpected and unusual. >>> >>> I think Usama said it would be a pain it working this way if it had to be >>> explicitly set as a boot time variable without defaulting like this. >>> >>>> >>>>> >>>>> I also question this de-coupling in general (I may be missing somethig >>>>> however!) - the watermark code _very explicitly_ refers to providing >>>>> _pageblocks_ in order to ensure _defragmentation_ right? >>>> >>>> Yes. Since without enough free memory (bigger than a PMD THP), >>>> memory compaction will just do useless work. >>> >>> Yeah right, so this is a key thing and why we need to rework the current >>> state of the patch. >>> >>>> >>>>> >>>>> We would need to absolutely justify why it's suddenly ok to not provide >>>>> page blocks here. >>>>> >>>>> This is very very delicate code we have to be SO careful about. >>>>> >>>>> This is why I am being cautious here :) >>>> >>>> Understood. In theory, we can associate watermarks with THP allowed orders >>>> the other way around too, meaning if user lowers vm.min_free_kbytes, >>>> all THP/mTHP sizes bigger than the watermark threshold are disabled >>>> automatically. This could fix the memory compaction issues, but >>>> that might also drive user crazy as they cannot use the THP sizes >>>> they want. >>> >>> Yeah that's interesting but I think that's just far too subtle and people will >>> have no idea what's going on. >>> >>> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to >>> go (but overrideable for people crazy enough to truly want 512 MB pages - and >>> who cannot then complain about watermarks). >> >> I agree. Basically, I am thinking: >> 1) use something like 2MB as default pageblock size for all arch (the value can >> be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default >> value; >> >> 2) make pageblock_order a boot time parameter, so that user who wants >> 512MB pages can still get it by changing pageblock order at boot time. >> >> WDYT? >> > > I was really hoping we would come up with a dynamic way of doing this, > especially one that doesn't require any more input from the user apart > from just setting the mTHP size via sysfs.. Then we will need to get rid of pageblock size from both watermark calculation and memory compaction and think about a new anti-fragmentation mechanism to handle unmovable pages as current pageblock based mechanism no longer fit the need. What you are expecting is: 1) watermarks should change as the largest enabled THP/mTHP size changes; 2) memory compaction targets the largest enabled THP/mTHP size (next step would improve memory compaction to optimize for all enabled sizes); 3) partitions of movable and unmovable pages can change dynamically based on the largest enabled THP/mTHP size; 4) pageblock size becomes irrelevant. > > 1) in a way is already done. We can set it to 2M by setting > ARCH_FORCE_MAX_ORDER to 5: > > In arch/arm64/Kconfig we already have: > > config ARCH_FORCE_MAX_ORDER > int > default "13" if ARM64_64K_PAGES > default "11" if ARM64_16K_PAGES > default "10" Nah, that means user no longer can allocate pages larger than 2MB, because the cap is in the buddy allocator. > > Doing 2) would require reboot and doing this just for changing mTHP size > will probably be a nightmare for workload orchestration. No. That is not what I mean. pageblock_order set at boot time only limits the largest mTHP size. By default, user can get up to 2MB THP/mTHP, but if they want to get 512MB THP, they can reboot with a larger pageblock order and they can still use 2MB mTHP. The downside is that with larger pageblock order, user cannot get the optimal THP/mTHP performance kernel is designed to achieve. Best Regards, Yan, Zi