Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





在 2025/7/26 00:16, Breno Leitao 写道:
Hello Shuai,

On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote:
APEI does not define an error type named GHES. GHES is just a kernel
driver name. Many hardware error types can be handled in GHES (see
ghes_do_proc), for example, AER is routed by GHES when firmware-first
mode is used. As far as I know, firmware-first mode is commonly used in
production. Should GHES errors be categorized into AER, memory, and CXL
memory instead?

I also considered slicing the data differently initially, but then
realized it would add more complexity than necessary for my needs.

If you believe we should further subdivide the data, I’m happy to do so.

You’re suggesting a structure like this, which would then map to the
corresponding CPER_SEC_ sections:

	enum hwerr_error_type {
	HWERR_RECOV_AER,     // maps to CPER_SEC_PCIE
	HWERR_RECOV_MCE,     // maps to default MCE + CPER_SEC_PCIE

CPER_SEC_PCIE is typo?

Correct, HWERR_RECOV_MCE would map to the regular MCE and not errors
coming from GHES.

	HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_*
	HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM
	}

Additionally, what about events related to CPU, Firmware, or DMA
errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we
include those in the classification as well?

I would like to split a error from ghes to its own type,
it sounds more reasonable. I can not tell what happened from HWERR_RECOV_AERat all :(

Makes sense. Regarding your answer, I suppose we might want to have
something like the following:

	enum hwerr_error_type {
		HWERR_RECOV_MCE,     // maps to errors in do_machine_check()
		HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_
		HWERR_RECOV_PCI,     // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI
		HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM_
		HWERR_RECOV_CPU,     // maps to CPER_SEC_PROC_
		HWERR_RECOV_DMA,     // maps to CPER_SEC_DMAR_
		HWERR_RECOV_OTHERS,  // maps to CPER_SEC_FW_, CPER_SEC_DMAR_,
	}

Is this what you think we should track?

Thanks
--breno

It sounds good to me.

Thanks.
Shuai





[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux