On Fri, Sep 5, 2025 at 12:39 PM <jane.chu@xxxxxxxxxx> wrote: > > > On 9/5/2025 11:17 AM, Luck, Tony wrote: > > BIOS can supply a GHES error record that reports that the corrected > > error threshold has been exceeded. Linux will attempt to soft offline > > the page in response. > > > > But "exceeded threshold" has many interpretations. Some BIOS versions > > accumulate error counts per-rank, and then report threshold exceeded > > when the number of errors crosses a threshold for the rank. Taking > > a page offline in this case is unlikely to solve any problems. But > > losing a 4KB page will have little impact on the overall system. Hi Tony, This is exactly the problem I encountered [1], and I agree with Jane that disabling soft offline via /proc/sys/vm/enable_soft_offline should work for your case. [1] https://lore.kernel.org/all/20240628205958.2845610-3-jiaqiyan@xxxxxxxxxx/T/#me8ff6bc901037e853d61d85d96aa3642cbd93b86 > > > > On the other hand, taking a huge page offline will have significant > > impact (and still not solve any problems). > > > > Check if the GHES record refers to a huge page. Skip the offline > > process if the page is huge. > > > > Reported-by: Shawn Fan <shawn.fan@xxxxxxxxx> > > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx> > > --- > > > > Change since v2: > > > > Me: Add sanity check on the address (pfn) that BIOS provided. It might > > be in some reserved area that doesn't have a "struct page" which would > > likely result in an OOPs if fed to pfn_folio(). > > > > The original code relied on sanity check of the pfn received from the > > BIOS when this eventually feeds into memory_failure(). That used to > > result in: > > pr_err("%#lx: memory outside kernel control\n", pfn); > > which won't happen with this change, since memory_failure is not > > called. Was that a useful message? A Google search mostly shows > > references to the code. There are few instances of people reporting > > they saw this message. > > > > > > drivers/acpi/apei/ghes.c | 13 +++++++++++-- > > 1 file changed, 11 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > > index a0d54993edb3..c2fc1196438c 100644 > > --- a/drivers/acpi/apei/ghes.c > > +++ b/drivers/acpi/apei/ghes.c > > @@ -540,8 +540,17 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, > > > > /* iff following two events can be handled properly by now */ > > if (sec_sev == GHES_SEV_CORRECTED && > > - (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) > > - flags = MF_SOFT_OFFLINE; > > + (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) { > > + unsigned long pfn = PHYS_PFN(mem_err->physical_addr); > > + > > + if (pfn_valid(pfn)) { > > + struct folio *folio = pfn_folio(pfn); > > + > > + /* Only try to offline non-huge pages */ > > + if (!folio_test_hugetlb(folio)) > > + flags = MF_SOFT_OFFLINE; > > + } > > + } > > if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) > > flags = sync ? MF_ACTION_REQUIRED : 0; > > > > So the issue is the result of inaccurate MCA record about per rank CE > threshold being crossed. If OS offline the indicted page, it might be > signaled to offline another 4K page in the same rank upon access. > > Both MCA and offline-op are performance hitter, and as argued by this > patch, offline doesn't help except loosing a already corrected page. > > Here we choose to bypass hugetlb page simply because it's huge. Is it > possible to argue that because the page is huge, it's less likely to get > another MCA on another page from the same rank? > > A while back this patch > 56374430c5dfc mm/memory-failure: userspace controls soft-offlining pages > has provided userspace control over whether to soft offline, could it be > a more preferable option? > > I don't know, the patch itself is fine, it's the issue that it has > exposed that is more concerning. > > thanks, > -jane > > > >