Re: [PATCH] ghes: Track number of recovered hardware errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hello Shuai,

On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
> > My plan with this patch is to have a counter for hardware errors that
> > would be exposed to the crashdump. So, post-morten analyzes tooling can
> > easily query if there are hardware errors and query RAS information in
> > the right databases, in case it seems a smoking gun.
> 
> I see your point. But does using a single ghes_recovered_errors counter
> to track all corrected and non-fatal errors for CPU, memory, and PCIe
> really help?

It provides a quick indication that hardware issues have occurred, which
can prompt the operator to investigate further via RAS events.

That said, Tony proposed a more robust approach—categorizing and
tracking errors by their source. This would involve maintaining separate
counters for each source using an counter per enum type:

	enum recovered_error_sources {
		ERR_GHES,
		ERR_MCE,
		ERR_AER,
		...
		ERR_NUM_SOURCES
	};

See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/

Do you think this would help you by any chance?

Thanks
--breno




[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux