Re: [PATCH 3/5] mm: add static huge zero folio

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04.08.25 18:46, Lorenzo Stoakes wrote:
On Mon, Aug 04, 2025 at 02:13:54PM +0200, Pankaj Raghav (Samsung) wrote:
From: Pankaj Raghav <p.raghav@xxxxxxxxxxx>

There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.

This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of single bvec.

This concern was raised during the review of adding LBS support to
XFS[1][2].

Usually huge_zero_folio is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left. At moment,
huge_zero_folio infrastructure refcount is tied to the process lifetime
that created it. This might not work for bio layer as the completions
can be async and the process that created the huge_zero_folio might no
longer be alive. And, one of the main point that came during discussion
is to have something bigger than zero page as a drop-in replacement.

Add a config option STATIC_HUGE_ZERO_FOLIO that will result in allocating
the huge zero folio on first request, if not already allocated, and turn
it static such that it can never get freed. This makes using the
huge_zero_folio without having to pass any mm struct and does not tie the
lifetime of the zero folio to anything, making it a drop-in replacement
for ZERO_PAGE.

If STATIC_HUGE_ZERO_FOLIO config option is enabled, then
mm_get_huge_zero_folio() will simply return this page instead of
dynamically allocating a new PMD page.

This option can waste memory in small systems or systems with 64k base
page size. So make it an opt-in and also add an option from individual
architecture so that we don't enable this feature for larger base page
size systems. Only x86 is enabled as a part of this series. Other
architectures shall be enabled as a follow-up to this series.

[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@xxxxxx/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@xxxxxxxxxxxxx/

Co-developed-by: David Hildenbrand <david@xxxxxxxxxx>
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
Signed-off-by: Pankaj Raghav <p.raghav@xxxxxxxxxxx>
---
  arch/x86/Kconfig        |  1 +
  include/linux/huge_mm.h | 18 ++++++++++++++++
  mm/Kconfig              | 21 +++++++++++++++++++
  mm/huge_memory.c        | 46 ++++++++++++++++++++++++++++++++++++++++-
  4 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0ce86e14ab5e..8e2aa1887309 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
  	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
  	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
  	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO if X86_64
  	select ARCH_HAS_PARANOID_L1D_FLUSH
  	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  	select BUILDTIME_TABLE_SORT
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b..78ebceb61d0e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -476,6 +476,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);

  extern struct folio *huge_zero_folio;
  extern unsigned long huge_zero_pfn;
+extern atomic_t huge_zero_folio_is_static;

Really don't love having globals like this, please can we have a helper
function that tells you this and not extern it?

Also we're not checking CONFIG_STATIC_HUGE_ZERO_FOLIO but still exposing
this value which a helper function would avoid also.


  static inline bool is_huge_zero_folio(const struct folio *folio)
  {
@@ -494,6 +495,18 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)

  struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
  void mm_put_huge_zero_folio(struct mm_struct *mm);
+struct folio *__get_static_huge_zero_folio(void);

Why are we declaring a static inline function prototype that we then
implement immediately below?

+
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+	if (!IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
+		return NULL;
+
+	if (likely(atomic_read(&huge_zero_folio_is_static)))
+		return huge_zero_folio;
+
+	return __get_static_huge_zero_folio();
+}

  static inline bool thp_migration_supported(void)
  {
@@ -685,6 +698,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
  {
  	return 0;
  }
+
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+	return NULL;
+}
  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

  static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..366a6d2d771e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -823,6 +823,27 @@ config ARCH_WANT_GENERAL_HUGETLB
  config ARCH_WANTS_THP_SWAP
  	def_bool n

+config ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO
+	def_bool n
+
+config STATIC_HUGE_ZERO_FOLIO
+	bool "Allocate a PMD sized folio for zeroing"
+	depends on ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO && TRANSPARENT_HUGEPAGE
+	help
+	  Without this config enabled, the huge zero folio is allocated on
+	  demand and freed under memory pressure once no longer in use.
+	  To detect remaining users reliably, references to the huge zero folio
+	  must be tracked precisely, so it is commonly only available for mapping
+	  it into user page tables.
+
+	  With this config enabled, the huge zero folio can also be used
+	  for other purposes that do not implement precise reference counting:
+	  it is still allocated on demand, but never freed, allowing for more
+	  wide-spread use, for example, when performing I/O similar to the
+	  traditional shared zeropage.
+
+	  Not suitable for memory constrained systems.
+
  config MM_ID
  	def_bool n

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff06dee213eb..e117b280b38d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -75,6 +75,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
  static bool split_underused_thp = true;

  static atomic_t huge_zero_refcount;
+atomic_t huge_zero_folio_is_static __read_mostly;
  struct folio *huge_zero_folio __read_mostly;
  unsigned long huge_zero_pfn __read_mostly = ~0UL;
  unsigned long huge_anon_orders_always __read_mostly;
@@ -266,6 +267,45 @@ void mm_put_huge_zero_folio(struct mm_struct *mm)
  		put_huge_zero_folio();
  }

+#ifdef CONFIG_STATIC_HUGE_ZERO_FOLIO
+

Extremely tiny silly nit - there's a blank line below this, but not under the
#endif, let's remove this line.

+struct folio *__get_static_huge_zero_folio(void)
+{
+	static unsigned long fail_count_clear_timer;
+	static atomic_t huge_zero_static_fail_count __read_mostly;
+
+	if (unlikely(!slab_is_available()))
+		return NULL;
+
+	/*
+	 * If we failed to allocate a huge zero folio, just refrain from
+	 * trying for one minute before retrying to get a reference again.
+	 */
+	if (atomic_read(&huge_zero_static_fail_count) > 1) {
+		if (time_before(jiffies, fail_count_clear_timer))
+			return NULL;
+		atomic_set(&huge_zero_static_fail_count, 0);
+	}

Yeah I really don't like this. This seems overly complicated and too
fiddly. Also if I want a static PMD, do I want to wait a minute for next
attempt?

Also doing things this way we might end up:

0. Enabling CONFIG_STATIC_HUGE_ZERO_FOLIO
1. Not doing anything that needs a static PMD for a while + get fragmentation.
2. Do something that needs it - oops can't get order-9 page, and waiting 60
    seconds between attempts
3. This is silent so you think you have it switched on but are actually getting
    bad performance.

I appreciate wanting to reuse this code, but we need to find a way to do this
really really early, and get rid of this arbitrary time out. It's very aribtrary
and we have no easy way of tracing how this might behave under workload.

Also we end up pinning an order-9 page either way, so no harm in getting it
first thing?

What we could do, to avoid messing with memblock and two ways of initializing a huge zero folio early, and just disable the shrinker.

Downside is that the page is really static (not just when actually used at least once). I like it:


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0ce86e14ab5e1..8e2aa18873098 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
 	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select BUILDTIME_TABLE_SORT
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b7..ccfa5c95f14b1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -495,6 +495,17 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
 void mm_put_huge_zero_folio(struct mm_struct *mm);
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+	if (!IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
+		return NULL;
+
+	if (unlikely(!huge_zero_folio))
+		return NULL;
+
+	return huge_zero_folio;
+}
+
 static inline bool thp_migration_supported(void)
 {
 	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
@@ -685,6 +696,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
 {
 	return 0;
 }
+
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+	return NULL;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf2..366a6d2d771e3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -823,6 +823,27 @@ config ARCH_WANT_GENERAL_HUGETLB
 config ARCH_WANTS_THP_SWAP
 	def_bool n
+config ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO
+	def_bool n
+
+config STATIC_HUGE_ZERO_FOLIO
+	bool "Allocate a PMD sized folio for zeroing"
+	depends on ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO && TRANSPARENT_HUGEPAGE
+	help
+	  Without this config enabled, the huge zero folio is allocated on
+	  demand and freed under memory pressure once no longer in use.
+	  To detect remaining users reliably, references to the huge zero folio
+	  must be tracked precisely, so it is commonly only available for mapping
+	  it into user page tables.
+
+	  With this config enabled, the huge zero folio can also be used
+	  for other purposes that do not implement precise reference counting:
+	  it is allocated statically and never freed, allowing for more
+	  wide-spread use, for example, when performing I/O similar to the
+	  traditional shared zeropage.
+
+	  Not suitable for memory constrained systems.
+
 config MM_ID
 	def_bool n
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff06dee213eb2..f65ba3e6f0824 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -866,9 +866,14 @@ static int __init thp_shrinker_init(void)
 	huge_zero_folio_shrinker->scan_objects = shrink_huge_zero_folio_scan;
 	shrinker_register(huge_zero_folio_shrinker);
- deferred_split_shrinker->count_objects = deferred_split_count;
-	deferred_split_shrinker->scan_objects = deferred_split_scan;
-	shrinker_register(deferred_split_shrinker);
+	if (IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO)) {
+		if (!get_huge_zero_folio())
+			pr_warn("Allocating static huge zero folio failed\n");
+	} else {
+		deferred_split_shrinker->count_objects = deferred_split_count;
+		deferred_split_shrinker->scan_objects = deferred_split_scan;
+		shrinker_register(deferred_split_shrinker);
+	}
return 0;
 }
--
2.50.1


Now, one thing I do not like is that we have "ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO" but
then have a user-selectable option.

Should we just get rid of ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO?

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux