On 09/06/2025 20:49, Zi Yan wrote: > On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote: > >> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote: >>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote: >>> >>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote: >>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote: >>>>> >>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote: >>>> >>>> [snip] >>>> >>>>>>> So I guess the question is what should be the next step? The following has been discussed: >>>>>>> >>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above >>>>>>> and might have unintended consequences if done at runtime, so a no go? >>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi). >>>>>>> The decoupling can be done separately. Watermark calculation can be decoupled using the >>>>>>> approach taken in this RFC. Although max order used by pagecache needs to be addressed. >>>>>>> >>>>>> >>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it >>>>>> feasible to simply statically adjust the pageblock size? >>>>>> >>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less >>>>>> likely there'll be fragmentation when requesting page blocks. >>>>>> >>>>>> And the watermark code is explicitly about providing reserves at a >>>>>> _pageblock granularity_. >>>>>> >>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks >>>>>> that we rarely use? >>>>>> >>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible >>>>>> level no? >>>>> >>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock >>>>> at 2MB, everyone should be happy at the moment. But if one user wants to >>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock >>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB >>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job. >>>>> >>>>> Defragmentation has two components: 1) pageblock, which has migratetypes >>>>> to prevent mixing movable and unmovable pages, as a single unmovable page >>>>> blocks large free pages from being created; 2) memory compaction granularity, >>>>> which is the actual work to move pages around and form a large free pages. >>>>> Currently, kernel assumes pageblock size = defragmentation granularity, >>>>> but in reality, as long as pageblock size >= defragmentation granularity, >>>>> memory compaction would still work, but not the other way around. So we >>>>> need to choose pageblock size carefully to not break memory compaction. >>>> >>>> OK I get it - the issue is that compaction itself operations at a pageblock >>>> granularity, and once you get so fragmented that compaction is critical to >>>> defragmentation, you are stuck if the pageblock is not big enough. >>> >>> Right. >>> >>>> >>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in >>>> itself, and if we're complaining about issues with unavailable reserved >>>> memory due to crazy PMD size, surely one will encounter the compaction >>>> process simply failing to succeed/taking forever/causing issues with >>>> reclaim/higher order folio allocation. >>> >>> Yep. Initially, we probably never thought PMD THP would be as large as >>> 512MB. >> >> Of course, such is the 'organic' nature of kernel development :) >> >>> >>>> >>>> I mean, I don't really know the compaction code _at all_ (ran out of time >>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or >>>> gives up? >>> >>> compaction works on one pageblock at a time, trying to migrate in-use pages >>> within the pageblock away to create a free page for THP allocation. >>> It assumes PMD THP size is equal to pageblock size. It will keep working >>> until a PMD THP size free page is created. This is a very high level >>> description, omitting a lot of details like how to avoid excessive compaction >>> work, how to reduce compaction latency. >> >> Yeah this matches my assumptions. >> >>> >>>> >>>> Because it strikes me that a crazy pageblock size would cause really >>>> serious system issues on that basis alone if that's the case. >>>> >>>> And again this leads me back to thinking it should just be the page block >>>> size _as a whole_ that should be adjusted. >>>> >>>> Keep in mind a user can literally reduce the page block size already via >>>> CONFIG_PAGE_BLOCK_MAX_ORDER. >>>> >>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size >>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size >>>> system. >>>> >>>> That way, people _can still get_ super huge PMD sized huge folios up to the >>>> point of fragmentation. >>>> >>>> If we do reduce things this way we should give a config option to allow >>>> users who truly want collosal PMD sizes with associated >>>> watermarks/compaction to be able to still have it. >>>> >>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something? >>> >>> I agree with capping pageblock size at a highest reasonable mTHP size. >>> In case there is some user relying on this huge PMD THP, making >>> pageblock a boot time variable might be a little better, since >>> they do not need to recompile the kernel for their need, assuming >>> distros will pick something like 2MB as the default pageblock size. >> >> Right, this seems sensible, as long as we set a _default_ that limits to >> whatever it would be, 2MB or such. >> >> I don't think it's unreasonable to make that change since this 512 MB thing >> is so entirely unexpected and unusual. >> >> I think Usama said it would be a pain it working this way if it had to be >> explicitly set as a boot time variable without defaulting like this. >> >>> >>>> >>>> I also question this de-coupling in general (I may be missing somethig >>>> however!) - the watermark code _very explicitly_ refers to providing >>>> _pageblocks_ in order to ensure _defragmentation_ right? >>> >>> Yes. Since without enough free memory (bigger than a PMD THP), >>> memory compaction will just do useless work. >> >> Yeah right, so this is a key thing and why we need to rework the current >> state of the patch. >> >>> >>>> >>>> We would need to absolutely justify why it's suddenly ok to not provide >>>> page blocks here. >>>> >>>> This is very very delicate code we have to be SO careful about. >>>> >>>> This is why I am being cautious here :) >>> >>> Understood. In theory, we can associate watermarks with THP allowed orders >>> the other way around too, meaning if user lowers vm.min_free_kbytes, >>> all THP/mTHP sizes bigger than the watermark threshold are disabled >>> automatically. This could fix the memory compaction issues, but >>> that might also drive user crazy as they cannot use the THP sizes >>> they want. >> >> Yeah that's interesting but I think that's just far too subtle and people will >> have no idea what's going on. >> >> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to >> go (but overrideable for people crazy enough to truly want 512 MB pages - and >> who cannot then complain about watermarks). > > I agree. Basically, I am thinking: > 1) use something like 2MB as default pageblock size for all arch (the value can > be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default > value; > > 2) make pageblock_order a boot time parameter, so that user who wants > 512MB pages can still get it by changing pageblock order at boot time. > > WDYT? > I was really hoping we would come up with a dynamic way of doing this, especially one that doesn't require any more input from the user apart from just setting the mTHP size via sysfs.. 1) in a way is already done. We can set it to 2M by setting ARCH_FORCE_MAX_ORDER to 5: In arch/arm64/Kconfig we already have: config ARCH_FORCE_MAX_ORDER int default "13" if ARM64_64K_PAGES default "11" if ARM64_16K_PAGES default "10" Doing 2) would require reboot and doing this just for changing mTHP size will probably be a nightmare for workload orchestration.