Hi all, This patch series introduces support for handling synchronous hardware errors on RISC-V, laying the groundwork for more robust kernel-mode error recovery. 1. Background Hardware error reporting mechanisms typically fall into two categories: asynchronous and synchronous. - Asynchronous errors (e.g., memory scrubbing errors) repoted by a asynchronous exceptions or a interrupt, are usually handled by GHES subsystems. For instance, ARM uses SDEI, and a similar SSE specification is being proposed for RISC-V. - Synchronous errors (e.g., reading poisoned data) cause the processor core to take a precise exception. This is known as a Synchronous External Abort (SEA) on ARM, a Machine Check Exception (MCE) on x86, and is designated as trap with mcause 19 on RISC-V. Discussions within the RVI PRS TG have already led to proposals[0] to UEFI for standardizing two notification methods, SSE and Hardware Error Exception, on RISC-V. This series focuses on implementing Hardware Error Exception notification to handle synchronous errors. Himanshu Chauhan has already started working on SSE[1]. 2. Motivation While a synchronous hardware errors occurring in kernel context (e.g., during get_user, put_user, CoW, etc.). The kernel requires a fixup mechanism (via extable) to recover from such errors and prevent a system panic. However, the APEI/GHES subsystem, being asynchronous, cannot directly leverage the synchronous extable fixup path. By handling the synchronous exception directly, we enable the use of this fixup mechanism, allowing the kernel to gracefully recover from hardware errors encountered during kernel execution. This brings RISC-V's error handling capabilities closer to the robustness found on ARM[2] and x86[3]. 3. What This Patch Series Does This initial series lays the foundational infrastructure. It primarily: - Introduces a new exception handler for synchronous hardware errors (mcause=19). - Establishes the core exception path, which is a prerequisite for kernel context error recovery. Please note that this version does not yet implement the full kernel fixup logic for recovery. That functionality is planned for the next formal version. Some adaptations for GHES are included, based on the work from Himanshu Chauhan[1] 4. Future Plans - Implement full kernel fixup support to handle and recover from errors in some kernel context[2]. - Add support for handling "double trap" scenarios. 5. Testing Methodology test program: ras-tools: https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/ qemu: https://github.com/winterddd/qemu offcial opensbi and edk2: - Run qemu: qemu-system-riscv64 -M virt,pflash0=pflash0,pflash1=pflash1,acpi=on,aia=aplic-imsic -cpu max -m 64G -smp 64 -device virtio-gpu-pci -full-screen -device qemu-xhci -device usb-kbd -device virtio-rng-pci -blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd -blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd -bios fw_dynamic.bin -device virtio-net-device,netdev=net0 -netdev user,id=net0,hostfwd=tcp::2223-:22 -kernel Image -initrd rootfs -append "rdinit=/sbin/init earlycon verbose debug strict_devmem=0 nokaslr" -monitor telnet:127.0.0.1:5557,server,nowait -nographic - Run ras-tools: ./einj_mem_uc -j -k single & $ 0: single vaddr = 0x7fff86ff4400 paddr = 107d11b400 - Inject poison telnet localhost 5557 poison_enable on poison_add 0x107d11b400 - Read poison echo trigger > ./trigger_start $ triggering ... $ signal 7 code 3 addr 0x7fff86ff4400 [0]: https://lists.riscv.org/g/tech-prs/topic/risc_v_ras_related_ecrs/113685653 [1]: https://patchew.org/linux/20250227123628.2931490-1-hchauhan@xxxxxxxxxxxxxxxx/ [2]: https://lore.kernel.org/lkml/20241209024257.3618492-1-tongtiangen@xxxxxxxxxx/ [3]: https://github.com/torvalds/linux/blob/9dd1835ecda5b96ac88c166f4a87386f3e727bd9/arch/x86/kernel/cpu/mce/core.c#L1514 Himanshu Chauhan (2): riscv: Define ioremap_cache for RISC-V riscv: Define arch_apei_get_mem_attribute for RISC-V Ruidong Tian (3): acpi: Introduce SSE and HEE in HEST notification types riscv: Introduce HEST HEE notification handlers for APEI riscv: Add Hardware Error Exception trap handler arch/riscv/Kconfig | 1 + arch/riscv/include/asm/acpi.h | 22 +++++++++++++ arch/riscv/include/asm/fixmap.h | 6 ++++ arch/riscv/include/asm/io.h | 3 ++ arch/riscv/kernel/acpi.c | 55 +++++++++++++++++++++++++++++++ arch/riscv/kernel/entry.S | 4 +++ arch/riscv/kernel/traps.c | 19 +++++++++++ drivers/acpi/apei/Kconfig | 12 +++++++ drivers/acpi/apei/ghes.c | 58 +++++++++++++++++++++++++++++++++ include/acpi/actbl1.h | 4 ++- include/acpi/ghes.h | 6 ++++ 11 files changed, 189 insertions(+), 1 deletion(-) -- 2.43.7