Re: [PATCH] ghes: Track number of recovered hardware errors

Breno Leitao <leitao@xxxxxxxxxx> · Wed, 16 Jul 2025 05:42:57 -0700

hello Shuai,

On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
> > My plan with this patch is to have a counter for hardware errors that
> > would be exposed to the crashdump. So, post-morten analyzes tooling can
> > easily query if there are hardware errors and query RAS information in
> > the right databases, in case it seems a smoking gun.
> 
> I see your point. But does using a single ghes_recovered_errors counter
> to track all corrected and non-fatal errors for CPU, memory, and PCIe
> really help?

It provides a quick indication that hardware issues have occurred, which
can prompt the operator to investigate further via RAS events.

That said, Tony proposed a more robust approach—categorizing and
tracking errors by their source. This would involve maintaining separate
counters for each source using an counter per enum type:

	enum recovered_error_sources {
		ERR_GHES,
		ERR_MCE,
		ERR_AER,
		...
		ERR_NUM_SOURCES
	};

See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/

Do you think this would help you by any chance?

Thanks
--breno