[RFC 00/14] cover-letter: Add virtualization support for EGM

<ankita@xxxxxxxxxx> · Thu, 4 Sep 2025 04:08:14 +0000

From: Ankit Agrawal <ankita@xxxxxxxxxx>

Background
----------
Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM)
feature that enable the GPU to access the system memory allocations
within and across nodes through high bandwidth path. This access path
goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the
system memory located on the same socket or from a different socket
or even on a different node in a multi-node system [1]. This feature is
being extended to virtualization.

Design Details
--------------
EGM when enabled in the virtualization stack, the host memory
is partitioned into 2 parts: One partition for the Host OS usage
called Hypervisor region, and a second Hypervisor-Invisible (HI) region
for the VM. Only the hypervisor region is part of the host EFI map
and is thus visible to the host OS on bootup. Since the entire VM
sysmem is eligible for EGM allocations within the VM, the HI partition
is interchangeably called as EGM region in the series. This HI/EGM region
range base SPA and size is exposed through the ACPI DSDT properties.

Whilst the EGM region is accessible on the host, it is not added to
the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
to the SPA using remap_pfn_range().

The following figure shows the memory map in the virtualization
environment.

|---- Sysmem ----|                  |--- GPU mem ---|  VM Memory
|                |                  |               |
|IPA <-> SPA map |                  |IPA <-> SPA map|
|                |                  |               |
|--- HI / EGM ---|-- Host Mem --|   |--- GPU mem ---|  Host Memory

The patch series introduce a new nvgrace-egm auxiliary driver module
to manage and map the HI/EGM region in the Grace Blackwell systems.
This binds to the auxiliary device created by the parent
nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
(out-of-tree open source module for SRIOV vGPU) to manage the
EGM region for the VM. Note that there is a unique EGM region per
socket and the auxiliary device gets created for every region. The
parent module fetches the EGM region information from the ACPI
tables and populate to the data structures shared with the auxiliary
nvgrace-egm module.

nvgrace-egm module handles the following:
1. Fetch the EGM memory properties (base HPA, length, proximity domain)
from the parent device shared EGM region structure.
2. Create a char device that can be used as memory-backend-file by Qemu
for the VM and implement file operations. The char device is /dev/egmX,
where X is the PXM node ID of the EGM being mapped fetched in 1.
3. Zero the EGM memory on first device open().
4. Map the QEMU VMA to the EGM region using remap_pfn_range.
5. Cleaning up state and destroying the chardev on device unbind.
6. Handle presence of retired ECC pages on the EGM region.

Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
in the same directory.

Implementation
--------------
Patch 1-4 makes changes to the nvgrace-gpu module to fetch the
EGM information, create auxiliary device and save the EGM region
information in the shared structures.
Path 5-10 introduce the new nvgrace-egm module to manage the EGM
region. The module implements a char device to expose the EGM to
usermode apps such as QEMU. The module does the mapping of the
QEMU VMA to the EGM SPA using remap_pfn range.
Patch 11-12 fetches the list of pages on EGM with known ECC errors.
Patch 13-14 expose the EGM topology and size through sysfs.

Enablement
----------
The EGM mode is enabled through a flag in the SBIOS. The size of
the Hypervisor region is modifiable through a second parameter in
the SBIOS. All the remaining system memory on the host will be
invisible to the Hypervisor.

Verification
------------
Applied over v6.17-rc4 and using qemu repository [3]. Tested on the
Grace Blackwell platform by booting up VM, loading NVIDIA module [2] and
running nvidia-smi in the VM to check for the presence of EGM capability.

To run CUDA workloads, there is a dependency on the Nested Page Table
patches being worked on separately by Shameer Kolothum
(skolothumtho@xxxxxxxxxx).

Recognitions
------------
Many thanks to Jason Gunthorpe, Vikram Sethi, Aniket Agashe for design
suggestions and Matt Ochs, Andy Currid, Neo Jia, Kirti Wankhede among
others for the review feedbacks.

Links
-----
Link: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/ankita-nv/nicolinc-qemu/tree/iommufd_veventq-v9-egm-0903 [3]

Ankit Agrawal (14):
  vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init
  vfio/nvgrace-gpu: Create auxiliary device for EGM
  vfio/nvgrace-gpu: track GPUs associated with the EGM regions
  vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
  vfio/nvgrace-egm: Introduce module to manage EGM
  vfio/nvgrace-egm: Introduce egm class and register char device numbers
  vfio/nvgrace-egm: Register auxiliary driver ops
  vfio/nvgrace-egm: Expose EGM region as char device
  vfio/nvgrace-egm: Add chardev ops for EGM management
  vfio/nvgrace-egm: Clear Memory before handing out to VM
  vfio/nvgrace-egm: Fetch EGM region retired pages list
  vfio/nvgrace-egm: Introduce ioctl to share retired pages
  vfio/nvgrace-egm: expose the egm size through sysfs
  vfio/nvgrace-gpu: Add link from pci to EGM

 MAINTAINERS                            |  12 +-
 drivers/vfio/pci/nvgrace-gpu/Kconfig   |  11 +
 drivers/vfio/pci/nvgrace-gpu/Makefile  |   5 +-
 drivers/vfio/pci/nvgrace-gpu/egm.c     | 418 +++++++++++++++++++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 174 ++++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  24 ++
 drivers/vfio/pci/nvgrace-gpu/main.c    | 117 ++++++-
 include/linux/nvgrace-egm.h            |  33 ++
 include/uapi/linux/egm.h               |  26 ++
 9 files changed, 816 insertions(+), 4 deletions(-)
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
 create mode 100644 include/linux/nvgrace-egm.h
 create mode 100644 include/uapi/linux/egm.h

-- 
2.34.1