On Wed, Sep 10, 2025 at 3:04 PM Ruidong Tian <tianruidong@xxxxxxxxxxxxxxxxx> wrote: > > Hi all, > This patch series introduces support for handling synchronous hardware errors > on RISC-V, laying the groundwork for more robust kernel-mode error recovery. > > 1. Background > Hardware error reporting mechanisms typically fall into two categories: > asynchronous and synchronous. > > - Asynchronous errors (e.g., memory scrubbing errors) repoted by a asynchronous > exceptions or a interrupt, are usually handled by GHES subsystems. For instance, > ARM uses SDEI, and a similar SSE specification is being proposed for RISC-V. > - Synchronous errors (e.g., reading poisoned data) cause the processor core to > take a precise exception. This is known as a Synchronous External Abort (SEA) > on ARM, a Machine Check Exception (MCE) on x86, and is designated as trap with > mcause 19 on RISC-V. > > Discussions within the RVI PRS TG have already led to proposals[0] to UEFI for > standardizing two notification methods, SSE and Hardware Error Exception, > on RISC-V. > This series focuses on implementing Hardware Error Exception notification to > handle synchronous errors. Himanshu Chauhan has already started working on SSE[1]. > > 2. Motivation > While a synchronous hardware errors occurring in kernel context (e.g., during > get_user, put_user, CoW, etc.). The kernel requires a fixup mechanism (via > extable) to recover from such errors and prevent a system panic. However, the > APEI/GHES subsystem, being asynchronous, cannot directly leverage the synchronous > extable fixup path. > > By handling the synchronous exception directly, we enable the use of this fixup > mechanism, allowing the kernel to gracefully recover from hardware errors > encountered during kernel execution. This brings RISC-V's error handling > capabilities closer to the robustness found on ARM[2] and x86[3]. > > 3. What This Patch Series Does > This initial series lays the foundational infrastructure. It primarily: > - Introduces a new exception handler for synchronous hardware errors (mcause=19). > - Establishes the core exception path, which is a prerequisite for kernel > context error recovery. > > Please note that this version does not yet implement the full kernel fixup logic > for recovery. That functionality is planned for the next formal version. > > Some adaptations for GHES are included, based on the work from Himanshu Chauhan[1] > > 4. Future Plans > - Implement full kernel fixup support to handle and recover from errors in > some kernel context[2]. > - Add support for handling "double trap" scenarios. > > 5. Testing Methodology > > test program: ras-tools: https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/ > qemu: https://github.com/winterddd/qemu > offcial opensbi and edk2: > > - Run qemu: > qemu-system-riscv64 -M virt,pflash0=pflash0,pflash1=pflash1,acpi=on,aia=aplic-imsic > -cpu max -m 64G -smp 64 -device virtio-gpu-pci -full-screen -device qemu-xhci > -device usb-kbd -device virtio-rng-pci > -blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd > -blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd > -bios fw_dynamic.bin -device virtio-net-device,netdev=net0 > -netdev user,id=net0,hostfwd=tcp::2223-:22 > -kernel Image -initrd rootfs > -append "rdinit=/sbin/init earlycon verbose debug strict_devmem=0 nokaslr" > -monitor telnet:127.0.0.1:5557,server,nowait -nographic > > - Run ras-tools: > ./einj_mem_uc -j -k single & > $ 0: single vaddr = 0x7fff86ff4400 paddr = 107d11b400 > > - Inject poison > telnet localhost 5557 > poison_enable on > poison_add 0x107d11b400 > > - Read poison > echo trigger > ./trigger_start > $ triggering ... > $ signal 7 code 3 addr 0x7fff86ff4400 > > [0]: https://lists.riscv.org/g/tech-prs/topic/risc_v_ras_related_ecrs/113685653 > [1]: https://patchew.org/linux/20250227123628.2931490-1-hchauhan@xxxxxxxxxxxxxxxx/ > [2]: https://lore.kernel.org/lkml/20241209024257.3618492-1-tongtiangen@xxxxxxxxxx/ > [3]: https://github.com/torvalds/linux/blob/9dd1835ecda5b96ac88c166f4a87386f3e727bd9/arch/x86/kernel/cpu/mce/core.c#L1514 > > Himanshu Chauhan (2): > riscv: Define ioremap_cache for RISC-V > riscv: Define arch_apei_get_mem_attribute for RISC-V > > Ruidong Tian (3): > acpi: Introduce SSE and HEE in HEST notification types > riscv: Introduce HEST HEE notification handlers for APEI > riscv: Add Hardware Error Exception trap handler > Himanshu had already sent-out RFC v1 way back in Feb 2025 [1] which did not receive any comments or feedback. Instead of sending out a half-baked series, it will be helpful if you can review Himanshu's series. Regards, Anup [1] https://patchew.org/linux/20250227123628.2931490-1-hchauhan@xxxxxxxxxxxxxxxx/