Re: [PATCH] ghes: Track number of recovered hardware errors

Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> · Wed, 16 Jul 2025 08:30:26 +0200

Em Wed, 16 Jul 2025 10:05:27 +0800
Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> escreveu:

> 在 2025/7/15 23:09, Borislav Petkov 写道:
> > On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:  
> >> For the purpose of counting, how about using the cmdline of rasdaemon?  
> > 
> > That would mean you have to run rasdaemon on those machines before they
> > explode and then carve out the rasdaemon db from the coredump (this is
> > post-mortem analysis).  
> 
> Rasdaemon is a userspace tool that will collect all hardware error 
> events reported by the Linux Kernel from several sources (EDAC, MCE, 
> PCI, ...) into one common framework. And it has been a standard tools
> in Alibaba. As far as I know, twitter also use Rasdaemon in its production.

There are several others using rasdaemon, afaikt. It was originally
implemented due to a demand from supercomputer customers with thousands
of nodes in US, and have been shipped on major distros for quite a while.

> 
> > 
> > I would love for rasdaemon to log over the network and then other tools can
> > query those centralized logs but that has its own challenges...
> >   
> 
> I also prefer collecting rasdaemon data in a centralized data center, as 
> this is more beneficial for using big data analytics to analyze and 
> predict errors. At the same time, the centralized side also uses 
> rasdaemon logs as one of the references for machine operations and 
> maintenance.
> 
> As for rasdaemon itself, it is just a single-node event collector and 
> database, although it does also print logs. In practice, we use SLS [1] 
> to collect rasdaemon text logs from individual nodes and parse them on 
> the central side.

Well, rasdaemon already uses SQL commands to store on its SQLite database.

It shouldn't be hard to add a patch series to optionally use a centralized
database directly. My only concern is that delivering logs to an external
database on a machine that has hardware errors can be problematic and
eventually end losing events.

Also, supporting different databases can be problematic due to the
libraries they require. Last time I wrote a code to write to an Oracle
DB (a life-long time ago), the number of the libraries that were required
were huge. Also, changing the order with "-l" caused ld to not find the
right objects. It was messy. Ok, supporting MySQL and PostgreSQL is not
that hard.

Perhaps a good compromise would be to add a logic there to open a local
socket or a tcp socket with a logger daemon, sending the events asynchronously
after storing locally at SQLite. Then, write a Python script using SQLAlchemy. 
This way, we gain for free support for several different databases.

Thanks,
Mauro