Re: [PATCH] ghes: Track number of recovered hardware errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





在 2025/7/16 01:25, Breno Leitao 写道:
Hello Shuai,

On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:
It would be really good to sync with other cloud providers here so that we can
do this one solution which fits all. Lemme CC some other folks I know who do
cloud gunk and leave the whole mail for their pleasure.

Newly CCed folks, you know how to find the whole discussion. :-)

Thx.


For the purpose of counting, how about using the cmdline of rasdaemon?

How do you manage it at a large fleet of hosts? Do you have rasdaemon
logging always and how do you correlate with kernel crashes? At Meta, we
have an a "clues" tag for each crash, and one of the tags is Machine
Check Exception (MCE), which is parsed from dmesg right now (with the
regexp I shared earlier).

We deploy rasdaemon on each individual node, and then collect the
rasdaemon logs centrally. At the same time, we collect out-of-band
error logs. We aggregate and count the types and occurrences of errors,
and finally use empirical thresholds for operational alerts. The crash
analysis service consumes these alert messages.


My plan with this patch is to have a counter for hardware errors that
would be exposed to the crashdump. So, post-morten analyzes tooling can
easily query if there are hardware errors and query RAS information in
the right databases, in case it seems a smoking gun.

I see your point. But does using a single ghes_recovered_errors counter
to track all corrected and non-fatal errors for CPU, memory, and PCIe
really help?


Do you have any experience with this type of automatic correlation?

Please see my reply above.


Thanks for your insights,
--breno

Thanks.
Shuai





[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux