hello Shuai, On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote: > > My plan with this patch is to have a counter for hardware errors that > > would be exposed to the crashdump. So, post-morten analyzes tooling can > > easily query if there are hardware errors and query RAS information in > > the right databases, in case it seems a smoking gun. > > I see your point. But does using a single ghes_recovered_errors counter > to track all corrected and non-fatal errors for CPU, memory, and PCIe > really help? It provides a quick indication that hardware issues have occurred, which can prompt the operator to investigate further via RAS events. That said, Tony proposed a more robust approach—categorizing and tracking errors by their source. This would involve maintaining separate counters for each source using an counter per enum type: enum recovered_error_sources { ERR_GHES, ERR_MCE, ERR_AER, ... ERR_NUM_SOURCES }; See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/ Do you think this would help you by any chance? Thanks --breno