Re: [RFC PATCH v1 00/10] Add RAS support for RISC-V architecture

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




在 2025/2/27 20:36, Himanshu Chauhan 写道:
This series implements the RAS (Reliability, Availability and Serviceability)
support for RISC-V architecture using RISC-V RERI specification. It is conformant
to ACPI platform error interfaces (APEI). It uses the highest priority
Supervisor Software Events (SSE)[2] to deliver the hardware error events to the kernel.
The SSE implemetation has already been merged in OpenSBI. Clement has sent a patch series for
its implemenation in Linux kernel.[5]

The GHES driver framework is used as is with the following changes for RISC-V:
	1. Register each ghes entry with SSE layer. Ghes notification vector is SSE event.
	2. Add RISC-V specific entries for processor type and ISA string
	3. Add fixmap indices GHES SSE Low and High Priority to help map and read from
	   physical addresses present in GHES entry.
	4. Other changes to build/configure the RAS support

How to Use:
----------
This RAS stack consists of Qemu[3], OpenSBI, EDK2[4], Linux kernel and devmem utility to inject and trigger
errors. Qemu [Ref.] has support to emulate RISC-V RERI. The RAS agent is implemented in OpenSBI which
creates CPER records. EDK2 generates HEST table and populates it with GHES entries with the help of
OpenSBI.

Qemu Command:
------------
<qemu-dir>/build/qemu-system-riscv64 \
     -s -accel tcg -m 4096 -smp 2 \
     -cpu rv64,smepmp=false \
     -serial mon:stdio \
     -d guest_errors -D ./qemu.log \
     -bios <opensbi-dir>/build/platform/generic/firmware/fw_dynamic.bin \
     -monitor telnet:127.0.0.1:55555,server,nowait \
     -device virtio-gpu-pci -full-screen \
     -device qemu-xhci \
     -device usb-kbd \
     -blockdev node-name=pflash0,driver=file,read-only=on,filename=<edk2-build-dir>/RiscVVirtQemu/RELEASE_GCC5/FV/RISCV_VIRT_CODE.fd \
     -blockdev node-name=pflash1,driver=file,filename=<edk2-build-dir>/RiscVVirtQemu/RELEASE_GCC5/FV/RISCV_VIRT_VARS.fd \
     -M virt,pflash0=pflash0,pflash1=pflash1,rpmi=true,reri=true,aia=aplic-imsic \
     -kernel <kernel image> \
     -initrd <rootfs image> \
     -append "root=/dev/ram rw console=ttyS0 earlycon=uart8250,mmio,0x10000000"

Error Injection & Triggering:
----------------------------
devmem 0x4010040 32 0x2a1
devmem 0x4010048 32 0x9001404
devmem 0x4010044 8 1

The above commands injects a TLB error on CPU 0.

Sample Output (CPU 0):
---------------------
[   34.370282] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[   34.371375] {1}[Hardware Error]: event severity: recoverable
[   34.372149] {1}[Hardware Error]:  Error 0, type: recoverable
[   34.372756] {1}[Hardware Error]:   section_type: general processor error
[   34.373357] {1}[Hardware Error]:   processor_type: 3, RISCV
[   34.373806] {1}[Hardware Error]:   processor_isa: 6, RISCV64
[   34.374294] {1}[Hardware Error]:   error_type: 0x02
[   34.374845] {1}[Hardware Error]:   TLB error
[   34.375448] {1}[Hardware Error]:   operation: 1, data read
[   34.376100] {1}[Hardware Error]:   target_address: 0x0000000000000000

References:
----------
[1] RERI Specification: https://github.com/riscv-non-isa/riscv-ras-eri/releases/download/v1.0/riscv-reri.pdf
[2] SSE Section in OpenSBI v3.0: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/v3.0-rc3/riscv-sbi.pdf
[3] Qemu source (with RERI emulation support): https://github.com/ventanamicro/qemu.git (branch: dev-upstream)
[4] EDK2: https://github.com/ventanamicro/edk2.git (branch: dev-upstream)
[5] SSE Kernel Patches: https://lore.kernel.org/linux-riscv/649fdead-09b0-4f94-a6ff-099fc970d890@xxxxxxxxxxxx/T/

Hi,

Thanks for this series.

I'm doing some work related to your patch. Besides SSE, I'm working on support
for another notification type for synchronous hardware errors (e.g., on a poison
read), which called Hardware Error Exception (HEE) in Dhaval Sharma's UEFI
proposal[0] in PRS-TG.  I have a patch for HEE support which I've sent out
separately[1].

Perhaps we could merge my work into your patchset to bringing a complete RAS
solution to the RISC-V architecture? Or, I'm also happy to wait for your patches
to land and then continue my work on top.

Let me know what you think would be best.

Cheers,
Ruidong Tian

[0]: https://lists.riscv.org/g/tech-prs/topic/risc_v_ras_related_ecrs/113685653
[1]: https://lore.kernel.org/all/20250910093347.75822-6-tianruidong@xxxxxxxxxxxxxxxxx/

Himanshu Chauhan (10):
   riscv: Define ioremap_cache for RISC-V
   riscv: Define arch_apei_get_mem_attribute for RISC-V
   acpi: Introduce SSE in HEST notification types
   riscv: Add fixmap indices for GHES IRQ and SSE contexts
   riscv: conditionally compile GHES NMI spool function
   riscv: Add functions to register ghes having SSE notification
   riscv: Add RISC-V entries in processor type and ISA strings
   riscv: Introduce HEST SSE notification handlers
   riscv: Add config option to enable APEI SSE handler
   riscv: Enable APEI and NMI safe cmpxchg options required for RAS

  arch/riscv/Kconfig                 |   2 +
  arch/riscv/include/asm/acpi.h      |  20 ++++
  arch/riscv/include/asm/fixmap.h    |   8 ++
  arch/riscv/include/asm/io.h        |   3 +
  drivers/acpi/apei/Kconfig          |   5 +
  drivers/acpi/apei/ghes.c           | 102 +++++++++++++++++---
  drivers/firmware/efi/cper.c        |   3 +
  drivers/firmware/riscv/riscv_sse.c | 147 +++++++++++++++++++++++++++++
  include/acpi/actbl1.h              |   3 +-
  include/linux/riscv_sse.h          |  15 +++
  10 files changed, 296 insertions(+), 12 deletions(-)





[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux