Re: PATCH v3 ACPI: APEI: GHES: Don't offline huge pages just because BIOS asked

jane.chu@xxxxxxxxxx · Fri, 5 Sep 2025 12:39:04 -0700

On 9/5/2025 11:17 AM, Luck, Tony wrote:
BIOS can supply a GHES error record that reports that the corrected
error threshold has been exceeded. Linux will attempt to soft offline
the page in response.

But "exceeded threshold" has many interpretations. Some BIOS versions
accumulate error counts per-rank, and then report threshold exceeded
when the number of errors crosses a threshold for the rank. Taking
a page offline in this case is unlikely to solve any problems. But
losing a 4KB page will have little impact on the overall system.

On the other hand, taking a huge page offline will have significant
impact (and still not solve any problems).

Check if the GHES record refers to a huge page. Skip the offline
process if the page is huge.

Reported-by: Shawn Fan <shawn.fan@xxxxxxxxx>
Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
---

Change since v2:

Me: Add sanity check on the address (pfn) that BIOS provided. It might
be in some reserved area that doesn't have a "struct page" which would
likely result in an OOPs if fed to pfn_folio().

The original code relied on sanity check of the pfn received from the
BIOS when this eventually feeds into memory_failure(). That used to
result in:
	pr_err("%#lx: memory outside kernel control\n", pfn);
which won't happen with this change, since memory_failure is not
called. Was that a useful message? A Google search mostly shows
references to the code. There are few instances of people reporting
they saw this message.


  drivers/acpi/apei/ghes.c | 13 +++++++++++--
  1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3..c2fc1196438c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
  
  	/* iff following two events can be handled properly by now */
  	if (sec_sev == GHES_SEV_CORRECTED &&
-	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
-		flags = MF_SOFT_OFFLINE;
+	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) {
+		unsigned long pfn = PHYS_PFN(mem_err->physical_addr);
+
+		if (pfn_valid(pfn)) {
+			struct folio *folio = pfn_folio(pfn);
+
+			/* Only try to offline non-huge pages */
+			if (!folio_test_hugetlb(folio))
+				flags = MF_SOFT_OFFLINE;
+		}
+	}
  	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
  		flags = sync ? MF_ACTION_REQUIRED : 0;
  

So the issue is the result of inaccurate MCA record about per rank CE 
threshold being crossed. If OS offline the indicted page, it might be 
signaled to offline another 4K page in the same rank upon access.

Both MCA and offline-op are performance hitter, and as argued by this 
patch, offline doesn't help except loosing a already corrected page.

Here we choose to bypass hugetlb page simply because it's huge.  Is it 
possible to argue that because the page is huge, it's less likely to get 
another MCA on another page from the same rank?

A while back this patch
56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages
has provided userspace control over whether to soft offline, could it be 
a more preferable option?

I don't know, the patch itself is fine, it's the issue that it has 
exposed that is more concerning.

thanks,
-jane