Hello Shuai, On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote: > > It would be really good to sync with other cloud providers here so that we can > > do this one solution which fits all. Lemme CC some other folks I know who do > > cloud gunk and leave the whole mail for their pleasure. > > > > Newly CCed folks, you know how to find the whole discussion. :-) > > > > Thx. > > > For the purpose of counting, how about using the cmdline of rasdaemon? How do you manage it at a large fleet of hosts? Do you have rasdaemon logging always and how do you correlate with kernel crashes? At Meta, we have an a "clues" tag for each crash, and one of the tags is Machine Check Exception (MCE), which is parsed from dmesg right now (with the regexp I shared earlier). My plan with this patch is to have a counter for hardware errors that would be exposed to the crashdump. So, post-morten analyzes tooling can easily query if there are hardware errors and query RAS information in the right databases, in case it seems a smoking gun. Do you have any experience with this type of automatic correlation? Thanks for your insights, --breno