On Fri, Aug 08, 2025 at 02:11:39PM +0200, Pankaj Raghav (Samsung) wrote: > From: Pankaj Raghav <p.raghav@xxxxxxxxxxx> > > Many places in the kernel need to zero out larger chunks, but the > maximum segment that can be zeroed out at a time by ZERO_PAGE is limited > by PAGE_SIZE. > > This is especially annoying in block devices and filesystems where > multiple ZERO_PAGEs are attached to the bio in different bvecs. With > multipage bvec support in block layer, it is much more efficient to send > out larger zero pages as a part of single bvec. > > This concern was raised during the review of adding Large Block Size > support to XFS[1][2]. > > Usually huge_zero_folio is allocated on demand, and it will be > deallocated by the shrinker if there are no users of it left. At moment, > huge_zero_folio infrastructure refcount is tied to the process lifetime > that created it. This might not work for bio layer as the completions > can be async and the process that created the huge_zero_folio might no > longer be alive. And, one of the main points that came up during > discussion is to have something bigger than zero page as a drop-in > replacement. > > Add a config option PERSISTENT_HUGE_ZERO_FOLIO that will result in > allocating the huge zero folio during early init and never free the memory > by disabling the shrinker. This makes using the huge_zero_folio without > having to pass any mm struct and does not tie the lifetime of the zero > folio to anything, making it a drop-in replacement for ZERO_PAGE. > > If PERSISTENT_HUGE_ZERO_FOLIO config option is enabled, then > mm_get_huge_zero_folio() will simply return the allocated page instead of > dynamically allocating a new PMD page. > > Use this option carefully in resource constrained systems as it uses > one full PMD sized page for zeroing purposes. > > [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@xxxxxx/ > [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@xxxxxxxxxxxxx/ > > Co-developed-by: David Hildenbrand <david@xxxxxxxxxx> > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > Signed-off-by: Pankaj Raghav <p.raghav@xxxxxxxxxxx> This is much nicer and now _super_ simple, I like it. A few nits below but generally: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > --- > include/linux/huge_mm.h | 16 ++++++++++++++++ > mm/Kconfig | 16 ++++++++++++++++ > mm/huge_memory.c | 40 ++++++++++++++++++++++++++++++---------- > 3 files changed, 62 insertions(+), 10 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 7748489fde1b..bd547857c6c1 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -495,6 +495,17 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) > struct folio *mm_get_huge_zero_folio(struct mm_struct *mm); > void mm_put_huge_zero_folio(struct mm_struct *mm); > > +static inline struct folio *get_persistent_huge_zero_folio(void) > +{ > + if (!IS_ENABLED(CONFIG_PERSISTENT_HUGE_ZERO_FOLIO)) > + return NULL; > + > + if (unlikely(!huge_zero_folio)) > + return NULL; > + > + return huge_zero_folio; > +} > + > static inline bool thp_migration_supported(void) > { > return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION); > @@ -685,6 +696,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb, > { > return 0; > } > + > +static inline struct folio *get_persistent_huge_zero_folio(void) > +{ > + return NULL; > +} > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > static inline int split_folio_to_list_to_order(struct folio *folio, > diff --git a/mm/Kconfig b/mm/Kconfig > index e443fe8cd6cf..fbe86ef97fd0 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -823,6 +823,22 @@ config ARCH_WANT_GENERAL_HUGETLB > config ARCH_WANTS_THP_SWAP > def_bool n > > +config PERSISTENT_HUGE_ZERO_FOLIO > + bool "Allocate a PMD sized folio for zeroing" > + depends on TRANSPARENT_HUGEPAGE I feel like we really need to sort out what is/isn't predicated on THP... it seems like THP is sort of short hand for 'any large folio stuff' but not always... But this is a more general point :) > + help > + Enable this option to reduce the runtime refcounting overhead > + of the huge zero folio and expand the places in the kernel > + that can use huge zero folios. This can potentially improve > + the performance while performing an I/O. NIT: I think we can drop 'an', and probably refactor this sentence to something like 'For instance, block I/O benefits from access to large folios for zeroing memory'. > + > + With this option enabled, the huge zero folio is allocated > + once and never freed. One full huge page worth of memory shall > + be used. NIT: huge page worth -> huge page's worth > + > + Say Y if your system has lots of memory. Say N if you are > + memory constrained. > + > config MM_ID > def_bool n > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index ff06dee213eb..bedda9640936 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -248,6 +248,9 @@ static void put_huge_zero_folio(void) > > struct folio *mm_get_huge_zero_folio(struct mm_struct *mm) > { > + if (IS_ENABLED(CONFIG_PERSISTENT_HUGE_ZERO_FOLIO)) > + return huge_zero_folio; > + > if (test_bit(MMF_HUGE_ZERO_FOLIO, &mm->flags)) > return READ_ONCE(huge_zero_folio); > > @@ -262,6 +265,9 @@ struct folio *mm_get_huge_zero_folio(struct mm_struct *mm) > > void mm_put_huge_zero_folio(struct mm_struct *mm) > { > + if (IS_ENABLED(CONFIG_PERSISTENT_HUGE_ZERO_FOLIO)) > + return; > + > if (test_bit(MMF_HUGE_ZERO_FOLIO, &mm->flags)) > put_huge_zero_folio(); > } > @@ -849,16 +855,34 @@ static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj) > > static int __init thp_shrinker_init(void) > { > - huge_zero_folio_shrinker = shrinker_alloc(0, "thp-zero"); > - if (!huge_zero_folio_shrinker) > - return -ENOMEM; > - > deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE | > SHRINKER_MEMCG_AWARE | > SHRINKER_NONSLAB, > "thp-deferred_split"); > - if (!deferred_split_shrinker) { > - shrinker_free(huge_zero_folio_shrinker); > + if (!deferred_split_shrinker) > + return -ENOMEM; > + > + deferred_split_shrinker->count_objects = deferred_split_count; > + deferred_split_shrinker->scan_objects = deferred_split_scan; > + shrinker_register(deferred_split_shrinker); > + > + if (IS_ENABLED(CONFIG_PERSISTENT_HUGE_ZERO_FOLIO)) { > + /* > + * Bump the reference of the huge_zero_folio and do not > + * initialize the shrinker. > + * > + * huge_zero_folio will always be NULL on failure. We assume > + * that get_huge_zero_folio() will most likely not fail as > + * thp_shrinker_init() is invoked early on during boot. > + */ > + if (!get_huge_zero_folio()) > + pr_warn("Allocating static huge zero folio failed\n"); > + return 0; > + } > + > + huge_zero_folio_shrinker = shrinker_alloc(0, "thp-zero"); > + if (!huge_zero_folio_shrinker) { > + shrinker_free(deferred_split_shrinker); > return -ENOMEM; > } > > @@ -866,10 +890,6 @@ static int __init thp_shrinker_init(void) > huge_zero_folio_shrinker->scan_objects = shrink_huge_zero_folio_scan; > shrinker_register(huge_zero_folio_shrinker); > > - deferred_split_shrinker->count_objects = deferred_split_count; > - deferred_split_shrinker->scan_objects = deferred_split_scan; > - shrinker_register(deferred_split_shrinker); > - > return 0; > } > > -- > 2.49.0 >