On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote: > > > On 09/06/2025 14:19, Zi Yan wrote: > > On 9 Jun 2025, at 7:13, Usama Arif wrote: > > > >> On 06/06/2025 17:10, Zi Yan wrote: > >>> On 6 Jun 2025, at 11:38, Usama Arif wrote: > >>> > >>>> On 06/06/2025 16:18, Zi Yan wrote: > >>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote: > >>>>> > >>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the > >>>>>> watermarks are evaluated to extremely high values, for e.g. a server with > >>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest > >>>>>> of the sizes set to never, the min, low and high watermarks evaluate to > >>>>>> 11.2G, 14G and 16.8G respectively. > >>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage > >>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M > >>>>>> and 1G respectively. > >>>>>> This is because set_recommended_min_free_kbytes is designed for PMD > >>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)). > >>>>>> Such high watermark values can cause performance and latency issues in > >>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough > >>>>>> most of them would never actually use a 512M PMD THP. > >>>>>> > >>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large > >>>>>> folio order enabled in set_recommended_min_free_kbytes. > >>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the > >>>>>> same machine with 64K page size, with the rest of the sizes set to never, > >>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G > >>>>>> respectively. When 512M THP hugepage size is set to madvise for the same > >>>>>> machine with 64K page size, the min, low and high watermarks evaluate to > >>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch. > >>>>> > >>>>> Getting pageblock_order involved here might be confusing. I think you just > >>>>> want to adjust min, low and high watermarks to reasonable values. > >>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone > >>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes > >>>>> look reasonable to me. > >>>> > >>>> Hi Zi, > >>>> > >>>> Thanks for the review! > >>>> > >>>> I forgot to change it in another place, sorry about that! So can't move > >>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function. > >>>> Have added the additional place where min_thp_pageblock_nr_pages() is called > >>>> as a fixlet here: > >>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@xxxxxxxxx/ > >>>> > >>>> I think atleast in this context the orginal name pageblock_nr_pages isn't > >>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER). > >>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy > >>>> to change it to something appropriate. > >>> > >>> Got it. pageblock is the defragmentation granularity. If user only wants > >>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise, > >>> kernel will defragment at 512MB granularity, which might not be efficient. > >>> Maybe make pageblock_order a boot time parameter? > >>> > >>> In addition, we are mixing two things together: > >>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction > >>> will be triggered; > >>> 2. pageblock order: it is the granularity of defragmentation for creating > >>> mTHP/THP. > >>> > >>> In your use case, you want to lower watermarks, right? Considering what you > >>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes, > >>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion > >>> is lame, sorry). > >>> > >>> I think for 2, we might want to decouple pageblock order from defragmentation > >>> granularity. > >>> > >> > >> This is a good point. I only did it for the watermarks in the RFC, but there > >> is no reason that the defrag granularity is done in 512M chunks and is probably > >> very inefficient to do so? > >> > >> Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes, > >> maybe we just need to change the definition of pageblock_order in [1] to take into > >> account the highest large folio order enabled instead of HPAGE_PMD_ORDER? > > > > Ideally, yes. But pageblock migratetypes are stored in a fixed size array > > determined by pageblock_order at boot time (see usemap_size() in mm/mm_init.c). > > Changing pageblock_order at runtime means we will need to resize pageblock > > migratetypes array, which is a little unrealistic. In a system with GBs or TBs > > memory, reducing pageblock_order by 1 means doubling pageblock migratetypes > > array and replicating one pageblock migratetypes to two; increasing pageblock > > order by 1 means halving the array and splitting a pageblock into two. > > The former, if memory is enough, might be easy, but the latter is a little > > involved, since for a pageblock with both movable and unmovable pages, > > you will need to check all pages to decide the migratetypes of the after-split > > pageblocks to make sure pageblock migratetype matches the pages inside that > > pageblock. > > > > Thanks for explaining this so well and the code pointer! > > Yeah it doesnt seem reasonable to change the size of pageblock_flags at runtime. > > > >> > >> [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50 > >> > >> I really want to avoid coming up with a solution that requires changing a Kconfig or needs > >> kernel commandline to change. It would mean a reboot whenever a different workload > >> runs on a server that works optimally with a different THP size, and that would make > >> workload orchestration a nightmare. > >> > > > > As I said above, changing pageblock order at runtime might not be easy. But > > changing defragmentation granularity should be fine, since it just changes > > the range of memory compaction. That is the reason of my proposal, > > decoupling pageblock order from defragmentation granularity. We probably > > need to do some experiments to see the impact of the decoupling, as I > > imagine defragmenting a range smaller than pageblock order is fine, but > > defragmenting a range larger than pageblock order might cause issues > > if there is any unmovable pageblock within that range. Since it is very likely > > unmovable pages reside in an unmovable pageblock and lead to a defragmentation > > failure. > > > > > > I saw you mentioned of a proposal to decouple pageblock order from defrag granularity > in one of the other replies as well, just wanted to check if there was anything you had > sent in lore in terms of proposal or RFC that I could look at. > > So I guess the question is what should be the next step? The following has been discussed: > > - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above > and might have unintended consequences if done at runtime, so a no go? > - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi). > The decoupling can be done separately. Watermark calculation can be decoupled using the > approach taken in this RFC. Although max order used by pagecache needs to be addressed. > I need to catch up with the thread (workload crazy atm), but why isn't it feasible to simply statically adjust the pageblock size? The whole point of 'defragmentation' is to _heuristically_ make it less likely there'll be fragmentation when requesting page blocks. And the watermark code is explicitly about providing reserves at a _pageblock granularity_. Why would we want to 'defragment' to 512MB physically contiguous chunks that we rarely use? Since it's all heuristic, it seems reasonable to me to cap it at a sensible level no? > > > -- > > Best Regards, > > Yan, Zi >