Re: [PATCH] ghes: Track number of recovered hardware errors

Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> · Tue, 15 Jul 2025 21:46:03 +0800

在 2025/7/15 20:53, Borislav Petkov 写道:
On Tue, Jul 15, 2025 at 05:02:39AM -0700, Breno Leitao wrote:
Hello Borislav,

On Tue, Jul 15, 2025 at 12:31:25PM +0200, Borislav Petkov wrote:
On Tue, Jul 15, 2025 at 03:20:35AM -0700, Breno Leitao wrote:
For instance, If every investigation (as you suggested above) take just
a couple of minutes, there simply wouldn’t be enough hours in the day,
even working 24x7, to keep up with the volume.

Well, first of all, it would help considerably if you put the use case in the
commit message.

Sorry, my bad. I can do better if we decide that this is worth pursuing.

Then, are you saying that when examining kernel crashes, you don't look at
I find that hard to believe.

We absolutely do examine kernel messages when investigating crashes, and
over time we've developed an extensive set of regular expressions to
identify relevant errors.

In practice, what you're describing is very similar to the workflow we
already use. For example, here are just a few of the regex patterns we
match in dmesg, grouped by category:

     (r"Machine check: Processor context corrupt", "cpu"),
     (r"Kernel panic - not syncing: Panicing machine check CPU died", "cpu"),
     (r"Machine check: Data load in unrecoverable area of kernel", "memory"),
     (r"Instruction fetch error in kernel", "memory"),
     (r"\[Hardware Error\]: +section_type: memory error", "memory"),
     (r"EDAC skx MC\d: HANDLING MCE MEMORY ERROR", "memory"),
     (r"\[Hardware Error\]:   section_type: general processor error", "cpu"),
     (r"UE memory read error on", "memory"),

And that’s just a partial list. We have 26 regexps for various issues,
and I wouldn’t be surprised if other large operators use a similar
approach.

While this system mostly works, there are real advantages to
consolidating this logic in the kernel itself, as I’m proposing:

     * Reduces the risk of mistakes
     	- Less chance of missing changes or edge cases.

     * Centralizes effort
	- Users don’t have to maintain their own lists; the logic lives
	  closer to the source of truth.

     * Simplifies maintenance
	- Avoids the constant need to update regexps if message strings
	  change.

     * Easier validation
	- It becomes straightforward to cross-check that all relevant
	  messages are being captured.

     * Automatic accounting
	- Any new or updated messages are immediately reflected.

     * Lower postmortem overhead
	- Requires less supporting infrastructure for crash analysis.

     * Netconsole support
	- Makes this status data available via netconsole, which is
	  helpful for those users.

Yap, this is more like it. Those sound to me like good reasons to have this
additional logging.

It would be really good to sync with other cloud providers here so that we can
do this one solution which fits all. Lemme CC some other folks I know who do
cloud gunk and leave the whole mail for their pleasure.

Newly CCed folks, you know how to find the whole discussion. :-)

Thx.

For the purpose of counting, how about using the cmdline of rasdaemon?

$ ras-mc-ctl --summary
Memory controller events summary:
        Uncorrected on DIMM Label(s): 'SOCKET 1 CHANNEL 1 DIMM 0 DIMM1' 
location: 0:18:-1:-1 errors: 1

PCIe AER events summary:
        2 Uncorrected (Non-Fatal) errors: Completion Timeout

ARM processor events summary:
        CPU(mpidr=0x81090100) has 1 errors
        CPU(mpidr=0x810e0000) has 1 errors
        CPU(mpidr=0x81180000) has 1 errors
        CPU(mpidr=0x811a0000) has 1 errors
        CPU(mpidr=0x811c0000) has 1 errors
        CPU(mpidr=0x811d0300) has 1 errors
        CPU(mpidr=0x811f0100) has 1 errors
        CPU(mpidr=0x81390300) has 1 errors
        CPU(mpidr=0x813a0200) has 1 errors

No devlink errors.
Disk errors summary:
        0:0 has 60 errors
        0:2048 has 7 errors
        0:66304 has 2162 errors
Memory failure events summary:
        Recovered errors: 24

@Breno, Is rasdaemon not enough for your needs?

AFAICS, it is easier to extend more statistical metrics, like PR 205 
[1]. Also, it is easier to carry out releases and changes than with the 
kernel in the production environment.

Thanks.
Shuai

[1] 
https://github.com/mchehab/rasdaemon/pull/205/commits/391d67bc7d17443d00db96850e56770451126a0e