On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote: > On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote: > > > On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote: > >> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote: > >> > >>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote: > > > > [snip] > > > >>>> So I guess the question is what should be the next step? The following has been discussed: > >>>> > >>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above > >>>> and might have unintended consequences if done at runtime, so a no go? > >>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi). > >>>> The decoupling can be done separately. Watermark calculation can be decoupled using the > >>>> approach taken in this RFC. Although max order used by pagecache needs to be addressed. > >>>> > >>> > >>> I need to catch up with the thread (workload crazy atm), but why isn't it > >>> feasible to simply statically adjust the pageblock size? > >>> > >>> The whole point of 'defragmentation' is to _heuristically_ make it less > >>> likely there'll be fragmentation when requesting page blocks. > >>> > >>> And the watermark code is explicitly about providing reserves at a > >>> _pageblock granularity_. > >>> > >>> Why would we want to 'defragment' to 512MB physically contiguous chunks > >>> that we rarely use? > >>> > >>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible > >>> level no? > >> > >> What is a sensible level? 2MB is a good starting point. If we cap pageblock > >> at 2MB, everyone should be happy at the moment. But if one user wants to > >> allocate 4MB mTHP, they will most likely fail miserably, because pageblock > >> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB > >> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job. > >> > >> Defragmentation has two components: 1) pageblock, which has migratetypes > >> to prevent mixing movable and unmovable pages, as a single unmovable page > >> blocks large free pages from being created; 2) memory compaction granularity, > >> which is the actual work to move pages around and form a large free pages. > >> Currently, kernel assumes pageblock size = defragmentation granularity, > >> but in reality, as long as pageblock size >= defragmentation granularity, > >> memory compaction would still work, but not the other way around. So we > >> need to choose pageblock size carefully to not break memory compaction. > > > > OK I get it - the issue is that compaction itself operations at a pageblock > > granularity, and once you get so fragmented that compaction is critical to > > defragmentation, you are stuck if the pageblock is not big enough. > > Right. > > > > > Thing is, 512MB pageblock size for compaction seems insanely inefficient in > > itself, and if we're complaining about issues with unavailable reserved > > memory due to crazy PMD size, surely one will encounter the compaction > > process simply failing to succeed/taking forever/causing issues with > > reclaim/higher order folio allocation. > > Yep. Initially, we probably never thought PMD THP would be as large as > 512MB. Of course, such is the 'organic' nature of kernel development :) > > > > > I mean, I don't really know the compaction code _at all_ (ran out of time > > to cover in book ;), but is it all or-nothing? Does it grab a pageblock or > > gives up? > > compaction works on one pageblock at a time, trying to migrate in-use pages > within the pageblock away to create a free page for THP allocation. > It assumes PMD THP size is equal to pageblock size. It will keep working > until a PMD THP size free page is created. This is a very high level > description, omitting a lot of details like how to avoid excessive compaction > work, how to reduce compaction latency. Yeah this matches my assumptions. > > > > > Because it strikes me that a crazy pageblock size would cause really > > serious system issues on that basis alone if that's the case. > > > > And again this leads me back to thinking it should just be the page block > > size _as a whole_ that should be adjusted. > > > > Keep in mind a user can literally reduce the page block size already via > > CONFIG_PAGE_BLOCK_MAX_ORDER. > > > > To me it seems that we should cap it at the highest _reasonable_ mTHP size > > you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size > > system. > > > > That way, people _can still get_ super huge PMD sized huge folios up to the > > point of fragmentation. > > > > If we do reduce things this way we should give a config option to allow > > users who truly want collosal PMD sizes with associated > > watermarks/compaction to be able to still have it. > > > > CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something? > > I agree with capping pageblock size at a highest reasonable mTHP size. > In case there is some user relying on this huge PMD THP, making > pageblock a boot time variable might be a little better, since > they do not need to recompile the kernel for their need, assuming > distros will pick something like 2MB as the default pageblock size. Right, this seems sensible, as long as we set a _default_ that limits to whatever it would be, 2MB or such. I don't think it's unreasonable to make that change since this 512 MB thing is so entirely unexpected and unusual. I think Usama said it would be a pain it working this way if it had to be explicitly set as a boot time variable without defaulting like this. > > > > > I also question this de-coupling in general (I may be missing somethig > > however!) - the watermark code _very explicitly_ refers to providing > > _pageblocks_ in order to ensure _defragmentation_ right? > > Yes. Since without enough free memory (bigger than a PMD THP), > memory compaction will just do useless work. Yeah right, so this is a key thing and why we need to rework the current state of the patch. > > > > > We would need to absolutely justify why it's suddenly ok to not provide > > page blocks here. > > > > This is very very delicate code we have to be SO careful about. > > > > This is why I am being cautious here :) > > Understood. In theory, we can associate watermarks with THP allowed orders > the other way around too, meaning if user lowers vm.min_free_kbytes, > all THP/mTHP sizes bigger than the watermark threshold are disabled > automatically. This could fix the memory compaction issues, but > that might also drive user crazy as they cannot use the THP sizes > they want. Yeah that's interesting but I think that's just far too subtle and people will have no idea what's going on. I really think a hard cap, expressed in KB/MB, on pageblock size is the way to go (but overrideable for people crazy enough to truly want 512 MB pages - and who cannot then complain about watermarks). > > Often, user just ask for an impossible combination: they > want to use all free memory, because they paid for it, and they > want THPs, because they want max performance. When PMD THP is > small like 2MB, the “unusable” free memory is not that noticeable, > but when PMD THP is as large as 512MB, user just cannot unsee it. :) Well, users asking for crazy things then being surprised when they get them is nothing new :P > > > Best Regards, > Yan, Zi Thanks for your input! Cheers, Lorenzo