* Tony Luck <tony.luck@xxxxxxxxx> [250904 11:57]: > BIOS can supply a GHES error record that reports that the corrected > error threshold has been exceeded. Linux will attempt to soft offline > the page in response. > > But "exceeded threshold" has many interpretations. Some BIOS versions > accumulate error counts per-rank, and then report threshold exceeded > when the number of errors crosses a threshold for the rank. Taking > a page offline in this case is unlikely to solve any problems. But > losing a 4KB page will have little impact on the overall system. > > On the other hand, taking a huge page offline will have significant > impact (and still not solve any problems). > > Check if the GHES record refers to a huge page. Skip the offline > process if the page is huge. > > Reported-by: Shawn Fan <shawn.fan@xxxxxxxxx> > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx> > --- > drivers/acpi/apei/ghes.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index a0d54993edb3..bacfebdd4969 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -540,8 +540,16 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, > > /* iff following two events can be handled properly by now */ > if (sec_sev == GHES_SEV_CORRECTED && > - (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) > + (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) { > + unsigned long pfn = PHYS_PFN(mem_err->physical_addr); > + struct page *page = pfn_to_page(pfn); > + struct folio *folio = page_folio(page); > + > + if (folio_test_hugetlb(folio)) > + return false; > + > flags = MF_SOFT_OFFLINE; > + } > if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) It is a bit odd that you return false vs not set the flags and continue the checks. The control flow may cause issues if other checks are added, and only if a huge page is hit, which will be difficult to debug. Looking at the code, it seems fine now, but when we don't return false (as you have added) it looks impossible to reach the line below anyways, so should it be an else if? I guess this is not strictly an issue with your patch. And if it doesn't solve anything anyways, why aren't we just skipping it altogether and maybe logging it? > flags = sync ? MF_ACTION_REQUIRED : 0; > > -- > 2.51.0 >