On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote: > On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote: > > > On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote: [snip] > >> So I guess the question is what should be the next step? The following has been discussed: > >> > >> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above > >> and might have unintended consequences if done at runtime, so a no go? > >> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi). > >> The decoupling can be done separately. Watermark calculation can be decoupled using the > >> approach taken in this RFC. Although max order used by pagecache needs to be addressed. > >> > > > > I need to catch up with the thread (workload crazy atm), but why isn't it > > feasible to simply statically adjust the pageblock size? > > > > The whole point of 'defragmentation' is to _heuristically_ make it less > > likely there'll be fragmentation when requesting page blocks. > > > > And the watermark code is explicitly about providing reserves at a > > _pageblock granularity_. > > > > Why would we want to 'defragment' to 512MB physically contiguous chunks > > that we rarely use? > > > > Since it's all heuristic, it seems reasonable to me to cap it at a sensible > > level no? > > What is a sensible level? 2MB is a good starting point. If we cap pageblock > at 2MB, everyone should be happy at the moment. But if one user wants to > allocate 4MB mTHP, they will most likely fail miserably, because pageblock > is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB > MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job. > > Defragmentation has two components: 1) pageblock, which has migratetypes > to prevent mixing movable and unmovable pages, as a single unmovable page > blocks large free pages from being created; 2) memory compaction granularity, > which is the actual work to move pages around and form a large free pages. > Currently, kernel assumes pageblock size = defragmentation granularity, > but in reality, as long as pageblock size >= defragmentation granularity, > memory compaction would still work, but not the other way around. So we > need to choose pageblock size carefully to not break memory compaction. OK I get it - the issue is that compaction itself operations at a pageblock granularity, and once you get so fragmented that compaction is critical to defragmentation, you are stuck if the pageblock is not big enough. Thing is, 512MB pageblock size for compaction seems insanely inefficient in itself, and if we're complaining about issues with unavailable reserved memory due to crazy PMD size, surely one will encounter the compaction process simply failing to succeed/taking forever/causing issues with reclaim/higher order folio allocation. I mean, I don't really know the compaction code _at all_ (ran out of time to cover in book ;), but is it all or-nothing? Does it grab a pageblock or gives up? Because it strikes me that a crazy pageblock size would cause really serious system issues on that basis alone if that's the case. And again this leads me back to thinking it should just be the page block size _as a whole_ that should be adjusted. Keep in mind a user can literally reduce the page block size already via CONFIG_PAGE_BLOCK_MAX_ORDER. To me it seems that we should cap it at the highest _reasonable_ mTHP size you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size system. That way, people _can still get_ super huge PMD sized huge folios up to the point of fragmentation. If we do reduce things this way we should give a config option to allow users who truly want collosal PMD sizes with associated watermarks/compaction to be able to still have it. CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something? I also question this de-coupling in general (I may be missing somethig however!) - the watermark code _very explicitly_ refers to providing _pageblocks_ in order to ensure _defragmentation_ right? We would need to absolutely justify why it's suddenly ok to not provide page blocks here. This is very very delicate code we have to be SO careful about. This is why I am being cautious here :) > > Best Regards, > Yan, Zi Thanks!