Em Wed, 16 Jul 2025 10:05:27 +0800 Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> escreveu: > 在 2025/7/15 23:09, Borislav Petkov 写道: > > On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote: > >> For the purpose of counting, how about using the cmdline of rasdaemon? > > > > That would mean you have to run rasdaemon on those machines before they > > explode and then carve out the rasdaemon db from the coredump (this is > > post-mortem analysis). > > Rasdaemon is a userspace tool that will collect all hardware error > events reported by the Linux Kernel from several sources (EDAC, MCE, > PCI, ...) into one common framework. And it has been a standard tools > in Alibaba. As far as I know, twitter also use Rasdaemon in its production. There are several others using rasdaemon, afaikt. It was originally implemented due to a demand from supercomputer customers with thousands of nodes in US, and have been shipped on major distros for quite a while. > > > > > I would love for rasdaemon to log over the network and then other tools can > > query those centralized logs but that has its own challenges... > > > > I also prefer collecting rasdaemon data in a centralized data center, as > this is more beneficial for using big data analytics to analyze and > predict errors. At the same time, the centralized side also uses > rasdaemon logs as one of the references for machine operations and > maintenance. > > As for rasdaemon itself, it is just a single-node event collector and > database, although it does also print logs. In practice, we use SLS [1] > to collect rasdaemon text logs from individual nodes and parse them on > the central side. Well, rasdaemon already uses SQL commands to store on its SQLite database. It shouldn't be hard to add a patch series to optionally use a centralized database directly. My only concern is that delivering logs to an external database on a machine that has hardware errors can be problematic and eventually end losing events. Also, supporting different databases can be problematic due to the libraries they require. Last time I wrote a code to write to an Oracle DB (a life-long time ago), the number of the libraries that were required were huge. Also, changing the order with "-l" caused ld to not find the right objects. It was messy. Ok, supporting MySQL and PostgreSQL is not that hard. Perhaps a good compromise would be to add a logic there to open a local socket or a tcp socket with a logger daemon, sending the events asynchronously after storing locally at SQLite. Then, write a Python script using SQLAlchemy. This way, we gain for free support for several different databases. Thanks, Mauro