Re: [PATCH 0/2] panic: taint flag for recoverable hardware errors

Borislav Petkov <bp@xxxxxxxxx> · Fri, 4 Jul 2025 13:19:54 +0200

On Fri, Jul 04, 2025 at 03:55:18AM -0700, Breno Leitao wrote:
> Add a new taint flag to the kernel (HW_ERROR_RECOVERED - for the lack of
> a better name) that gets set whenever the kernel detects and recovers
> from hardware errors.
> 
> The taint provides additional context during crash investigation *without*
> implying that crashes are necessarily caused by hardware failures
> (similar to how PROPRIETARY_MODULE taint works). It is just an extra
> information that will provide more context about that machine.

Dunno, looks like a hack to me to serve your purpose only.

Because when this goes up, then people will start wanting to taint the kernel
for *every* *single* correctable error.

So even if an error got corrected, the kernel will be tainted.

Then users will say, oh oh, my kernel is tainted, I need to replace my hw
because broken. Even if it isn't broken in the very least.

Basically what we're doing with drivers/ras/cec.c will be undone.

All because you want to put a bit of information somewhere that the machine
had a recoverable error.

Well, that bit of information is in your own RAS logs, no? I presume you log
hw errors in a big fleet and then you analyze those logs when the machine
bombs. So a mere look at those logs will tell you that you had hw errors.

And mind you, that proposed solution does not help people who want to know
what the errors were: "Oh look, my kernel got tainted because of hw errors. Now
where are those errors?"

So I think this is just adding redundant information which we already have
somewhere else and also actively can mislead users.

IOW, no need to taint - you want to simply put a bit of info in the kdump blob
which gets dumped by the second kernel that the first kernel experienced hw
errors. That is, if you don't log hw errors. But you should...!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette