From: Ankit Agrawal <ankita@xxxxxxxxxx> Background ---------- Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM) feature that enable the GPU to access the system memory allocations within and across nodes through high bandwidth path. This access path goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the system memory located on the same socket or from a different socket or even on a different node in a multi-node system [1]. This feature is being extended to virtualization. Design Details -------------- EGM when enabled in the virtualization stack, the host memory is partitioned into 2 parts: One partition for the Host OS usage called Hypervisor region, and a second Hypervisor-Invisible (HI) region for the VM. Only the hypervisor region is part of the host EFI map and is thus visible to the host OS on bootup. Since the entire VM sysmem is eligible for EGM allocations within the VM, the HI partition is interchangeably called as EGM region in the series. This HI/EGM region range base SPA and size is exposed through the ACPI DSDT properties. Whilst the EGM region is accessible on the host, it is not added to the kernel. The HI region is assigned to a VM by mapping the QEMU VMA to the SPA using remap_pfn_range(). The following figure shows the memory map in the virtualization environment. |---- Sysmem ----| |--- GPU mem ---| VM Memory | | | | |IPA <-> SPA map | |IPA <-> SPA map| | | | | |--- HI / EGM ---|-- Host Mem --| |--- GPU mem ---| Host Memory The patch series introduce a new nvgrace-egm auxiliary driver module to manage and map the HI/EGM region in the Grace Blackwell systems. This binds to the auxiliary device created by the parent nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio (out-of-tree open source module for SRIOV vGPU) to manage the EGM region for the VM. Note that there is a unique EGM region per socket and the auxiliary device gets created for every region. The parent module fetches the EGM region information from the ACPI tables and populate to the data structures shared with the auxiliary nvgrace-egm module. nvgrace-egm module handles the following: 1. Fetch the EGM memory properties (base HPA, length, proximity domain) from the parent device shared EGM region structure. 2. Create a char device that can be used as memory-backend-file by Qemu for the VM and implement file operations. The char device is /dev/egmX, where X is the PXM node ID of the EGM being mapped fetched in 1. 3. Zero the EGM memory on first device open(). 4. Map the QEMU VMA to the EGM region using remap_pfn_range. 5. Cleaning up state and destroying the chardev on device unbind. 6. Handle presence of retired ECC pages on the EGM region. Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept in the same directory. Implementation -------------- Patch 1-4 makes changes to the nvgrace-gpu module to fetch the EGM information, create auxiliary device and save the EGM region information in the shared structures. Path 5-10 introduce the new nvgrace-egm module to manage the EGM region. The module implements a char device to expose the EGM to usermode apps such as QEMU. The module does the mapping of the QEMU VMA to the EGM SPA using remap_pfn range. Patch 11-12 fetches the list of pages on EGM with known ECC errors. Patch 13-14 expose the EGM topology and size through sysfs. Enablement ---------- The EGM mode is enabled through a flag in the SBIOS. The size of the Hypervisor region is modifiable through a second parameter in the SBIOS. All the remaining system memory on the host will be invisible to the Hypervisor. Verification ------------ Applied over v6.17-rc4 and using qemu repository [3]. Tested on the Grace Blackwell platform by booting up VM, loading NVIDIA module [2] and running nvidia-smi in the VM to check for the presence of EGM capability. To run CUDA workloads, there is a dependency on the Nested Page Table patches being worked on separately by Shameer Kolothum (skolothumtho@xxxxxxxxxx). Recognitions ------------ Many thanks to Jason Gunthorpe, Vikram Sethi, Aniket Agashe for design suggestions and Matt Ochs, Andy Currid, Neo Jia, Kirti Wankhede among others for the review feedbacks. Links ----- Link: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [1] Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2] Link: https://github.com/ankita-nv/nicolinc-qemu/tree/iommufd_veventq-v9-egm-0903 [3] Ankit Agrawal (14): vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init vfio/nvgrace-gpu: Create auxiliary device for EGM vfio/nvgrace-gpu: track GPUs associated with the EGM regions vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info vfio/nvgrace-egm: Introduce module to manage EGM vfio/nvgrace-egm: Introduce egm class and register char device numbers vfio/nvgrace-egm: Register auxiliary driver ops vfio/nvgrace-egm: Expose EGM region as char device vfio/nvgrace-egm: Add chardev ops for EGM management vfio/nvgrace-egm: Clear Memory before handing out to VM vfio/nvgrace-egm: Fetch EGM region retired pages list vfio/nvgrace-egm: Introduce ioctl to share retired pages vfio/nvgrace-egm: expose the egm size through sysfs vfio/nvgrace-gpu: Add link from pci to EGM MAINTAINERS | 12 +- drivers/vfio/pci/nvgrace-gpu/Kconfig | 11 + drivers/vfio/pci/nvgrace-gpu/Makefile | 5 +- drivers/vfio/pci/nvgrace-gpu/egm.c | 418 +++++++++++++++++++++++++ drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 174 ++++++++++ drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 24 ++ drivers/vfio/pci/nvgrace-gpu/main.c | 117 ++++++- include/linux/nvgrace-egm.h | 33 ++ include/uapi/linux/egm.h | 26 ++ 9 files changed, 816 insertions(+), 4 deletions(-) create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h create mode 100644 include/linux/nvgrace-egm.h create mode 100644 include/uapi/linux/egm.h -- 2.34.1