[RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



1. Background
=============

NVIDIA vGPU[1] software enables powerful GPU performance for workloads
ranging from graphics-rich virtual workstations to data science and AI,
enabling IT to leverage the management and security benefits of
virtualization as well as the performance of NVIDIA GPUs required for
modern workloads. Installed on a physical GPU in a cloud or enterprise
data center server, NVIDIA vGPU software creates virtual GPUs that can
be shared across multiple virtual machines.

The vGPU architecture[2] can be illustrated as follow:

 +--------------------+    +--------------------+ +--------------------+ +--------------------+ 
 | Hypervisor         |    | Guest VM           | | Guest VM           | | Guest VM           | 
 |                    |    | +----------------+ | | +----------------+ | | +----------------+ | 
 | +----------------+ |    | |Applications... | | | |Applications... | | | |Applications... | | 
 | |  NVIDIA        | |    | +----------------+ | | +----------------+ | | +----------------+ | 
 | |  Virtual GPU   | |    | +----------------+ | | +----------------+ | | +----------------+ | 
 | |  Manager       | |    | |  Guest Driver  | | | |  Guest Driver  | | | |  Guest Driver  | | 
 | +------^---------+ |    | +----------------+ | | +----------------+ | | +----------------+ | 
 |        |           |    +---------^----------+ +----------^---------+ +----------^---------+ 
 |        |           |              |                       |                      |           
 |        |           +--------------+-----------------------+----------------------+---------+ 
 |        |                          |                       |                      |         | 
 |        |                          |                       |                      |         | 
 +--------+--------------------------+-----------------------+----------------------+---------+ 
+---------v--------------------------+-----------------------+----------------------+----------+
| NVIDIA                  +----------v---------+ +-----------v--------+ +-----------v--------+ |
| Physical GPU            |   Virtual GPU      | |   Virtual GPU      | |   Virtual GPU      | |
|                         +--------------------+ +--------------------+ +--------------------+ |
+----------------------------------------------------------------------------------------------+

Each NVIDIA vGPU is analogous to a conventional GPU, having a fixed amount
of GPU framebuffer, and one or more virtual display outputs or "heads".
The vGPU's framebuffer is allocated out of the physical GPU's framebuffer
at the time the vGPU is created, and the vGPU retains exclusive use of
that framebuffer until it is destroyed.

Each physical GPU can support several different types of virtual GPU
(vGPU). vGPU types have a fixed amount of frame buffer, number of
supported display heads, and maximum resolutions. They are grouped into
different series according to the different classes of workload for which
they are optimized. Each series is identified by the last letter of the
vGPU type name.

NVIDIA vGPU supports Windows and Linux guest VM operating systems. The
supported vGPU types depend on the guest VM OS.

2. Proposal For Upstream
========================

2.1 Architecture
----------------

Moving to the upstream, the proposed architecture can be illustrated as followings:

                            +--------------------+ +--------------------+ +--------------------+ 
                            | Linux VM           | | Windows VM         | | Guest VM           | 
                            | +----------------+ | | +----------------+ | | +----------------+ | 
                            | |Applications... | | | |Applications... | | | |Applications... | | 
                            | +----------------+ | | +----------------+ | | +----------------+ | ... 
                            | +----------------+ | | +----------------+ | | +----------------+ | 
                            | |  Guest Driver  | | | |  Guest Driver  | | | |  Guest Driver  | | 
                            | +----------------+ | | +----------------+ | | +----------------+ | 
                            +---------^----------+ +----------^---------+ +----------^---------+ 
                                      |                       |                      |           
                           +--------------------------------------------------------------------+
                           |+--------------------+ +--------------------+ +--------------------+|
                           ||       QEMU         | |       QEMU         | |       QEMU         ||
                           ||                    | |                    | |                    ||
                           |+--------------------+ +--------------------+ +--------------------+|
                           +--------------------------------------------------------------------+
                                      |                       |                      |
+-----------------------------------------------------------------------------------------------+
|                           +----------------------------------------------------------------+  |
|                           |                                VFIO                            |  |
|                           |                                                                |  |
| +-----------------------+ | +-------------------------------------------------------------+|  |
| |                       | | |                                                             ||  |
| |     nova_core        <--->|                                                             ||  |
| +    (core driver)      + | |                      NVIDIA vGPU VFIO Driver                ||  |
| |                       | | |                                                             ||  |
| |                       | | +-------------------------------------------------------------+|  |
| +--------^--------------+ +----------------------------------------------------------------+  |
|          |                          |                       |                      |          |
+-----------------------------------------------------------------------------------------------+
           |                          |                       |                      |           
+----------|--------------------------|-----------------------|----------------------|----------+
|          v               +----------v---------+ +-----------v--------+ +-----------v--------+ |
|  NVIDIA                  |       PCI VF       | |       PCI VF       | |       PCI VF       | |
|  Physical GPU            |                    | |                    | |                    | |
|                          |   (Virtual GPU)    | |   (Virtual GPU)    | |    (Virtual GPU)   | |
|                          +--------------------+ +--------------------+ +--------------------+ |
+-----------------------------------------------------------------------------------------------+

Each virtual GPU (vGPU) instance is implemented atop a PCIe Virtual
Function (VF). The NVIDIA vGPU VFIO driver, in coordination with the
VFIO framework, operates directly on these VFs to enable key
functionalities including vGPU type selection, dynamic instantiation and
destruction of vGPU instances, support for live migration, and warm
update.

Consistent with other VFIO variant drivers, the NVIDIA vGPU VFIO driver
adheres to the standard VFIO userspace interface, facilitating device
lifecycle management and integration with advanced VFIO capabilities.

At the low level, the NVIDIA vGPU VFIO driver interfaces with the core
driver, which provides the necessary abstractions and mechanisms to access
and manipulate the underlying GPU hardware resources.

2.2 Core Driver (nova_core)
---------------------------

The primary deployment model for cloud service providers (CSPs) and
enterprise environments is to have a standalone, minimal driver stack
with the vGPU support and other essential components. Thus, a minimal
core driver is required to support the NVIDIA vGPU VFIO driver.

The core GPU driver provides the foundational infrastructure necessary
to support the following operations:

- Firmware management: Load the GSP (GPU System Processor) firmware,
initiate GSP boot procedures, and establish the communication channel
between the host and the GSP.

- Hardware resource management: Control and partition shared GPU
resources-such as framebuffer memory and hardware channels-used by the
VFIO driver for instantiating and operating vGPUs.

- Exception handling: Relay hardware and firmware-level exception events,
including GSP notifications, to the VFIO driver. E.g. FIFO nonstall.

- Host event coordination: Handle system-wide events such as suspend and
resume, PF driver unbind, etc. ensuring proper synchronization with GPU
subsystems.

- Hardware configuration enumeration: Discover and expose static and
dynamic hardware capabilities required for vGPU orchestration. E.g.
engine bitmap, total FB memory size.

2.3 NVIDIA vGPU VFIO Driver
---------------------------

The NVIDIA vGPU VFIO driver exposes standard VFIO interfaces for userspace
access to vGPUs, while also providing control paths for vGPU creation and
destruction with the help of core driver.

The driver provides an additional sysfs interface for the admin to query
the creatable vGPU types on a VF. Once the vGPU type is selected, the
userspace VMM, e.g. QEMU can manipulate the VF via the standard VFIO
device interfaces. Only homogeneous vGPU has been supported.

As different NVIDIA GPUs support different available vGPU types, a
loadable vGPU metadata file is introduced to host those blobs,
which are the support vGPU types on supported NVIDIA GPUs. It is loaded
together with the VFIO driver. The VFIO driver chooses the usable vGPU
types from it based on the installed NVIDIA GPU in the system.

The driver also exposes an per-vGPU logging interface to collect the GSP
logs for bug report.

2.4 Changes from RFC [3]
-----------------------------

- vGPU is supported since GSP microcode 570.
- Multiple vGPU support with homogeneous scheme.
- CE workload submission for FB memory scrubbing.
- Interface to create/destroy/select vGPUs.
- Loadable vGPU type support via vGPU metadata file.
- Expose per-vGPU GSP log.
- Proper VFIO driver attach/detach flow.
- PF driver event forwarding to support PF driver unbind by admin.

3 Try the patches
-----------------------

- Host kernel: http://github.com/zhiwang-nvidia/linux/tree/zhi/vgpu-rfc-v2
- vGPU metadata file: https://github.com/zhiwang-nvidia/vgpu-tools/blob/metadata/metadata/18.1/vgpu-570.144.bin
  The metadata file needs to be placed at: /lib/firmware/nvidia
- Guest driver package: NVIDIA-Linux-x86_64-570.124.04.run [4]

  Install guest driver:
  # export GRID_BUILD=1
  # ./NVIDIA-Linux-x86_64-570.124.04.run

- Tested platforms: RTX A6000 Ada.
- Tested host OS: RHEL 8.4.
- Tested guest OS: Ubutnu 24.04 LTS, Windows 11.
- Supported experience: Rich desktop experience with simple 3D workload,
  e.g. glmark2, heaven.

- Demo video: running heaven on two -24Q vGPUs on NVIDIA RTX A6000 Ada [5]

[1] https://www.nvidia.com/en-us/data-center/virtual-solutions/
[2] https://docs.nvidia.com/vgpu/17.0/grid-vgpu-user-guide/index.html#architecture-grid-vgpu
[3] https://lore.kernel.org/kvm/20240922161121.000060a0.zhiw@xxxxxxxxxx/T/
[4] https://us.download.nvidia.com/XFree86/Linux-x86_64/570.124.04/NVIDIA-Linux-x86_64-570.124.04.run
[5] https://youtu.be/DhW--wVlLfU

Zhi Wang (14):
  vfio/nvidia-vgpu: introduce vGPU lifecycle management prelude
  vfio/nvidia-vgpu: allocate GSP RM client for NVIDIA vGPU manager
  vfio/nvidia-vgpu: introduce vGPU type uploading
  vfio/nvidia-vgpu: allocate vGPU channels when creating vGPUs
  vfio/nvidia-vgpu: allocate vGPU FB memory when creating vGPUs
  vfio/nvidia-vgpu: allocate mgmt heap when creating vGPUs
  vfio/nvidia-vgpu: map mgmt heap when creating a vGPU
  vfio/nvidia-vgpu: allocate GSP RM client when creating vGPUs
  vfio/nvidia-vgpu: bootload the new vGPU
  vfio/nvidia-vgpu: introduce vGPU host RPC channel
  vfio/nvidia-vgpu: introduce NVIDIA vGPU VFIO variant driver
  vfio/nvidia-vgpu: scrub the guest FB memory of a vGPU
  vfio/nvidia-vgpu: introduce vGPU logging
  vfio/nvidia-vgpu: add a kernel doc to introduce NVIDIA vGPU

 .../ABI/stable/sysfs-driver-nvidia-vgpu       |  11 +
 Documentation/gpu/drivers.rst                 |   1 +
 Documentation/gpu/nvidia-vgpu.rst             | 264 +++++++
 drivers/vfio/pci/Kconfig                      |   2 +
 drivers/vfio/pci/Makefile                     |   2 +
 drivers/vfio/pci/nvidia-vgpu/Kconfig          |  15 +
 drivers/vfio/pci/nvidia-vgpu/Makefile         |   6 +
 drivers/vfio/pci/nvidia-vgpu/debug.h          |  35 +
 drivers/vfio/pci/nvidia-vgpu/debugfs.c        |  65 ++
 .../pci/nvidia-vgpu/include/nvrm/bootload.h   |  58 ++
 .../vfio/pci/nvidia-vgpu/include/nvrm/ecc.h   |  45 ++
 .../vfio/pci/nvidia-vgpu/include/nvrm/gsp.h   |  18 +
 .../nvidia-vgpu/include/nvrm/nv_vgpu_types.h  |  34 +
 .../pci/nvidia-vgpu/include/nvrm/nvtypes.h    |  26 +
 .../vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h  | 182 +++++
 .../vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h  |  39 +
 drivers/vfio/pci/nvidia-vgpu/metadata.c       | 319 ++++++++
 drivers/vfio/pci/nvidia-vgpu/metadata.h       |  89 +++
 .../vfio/pci/nvidia-vgpu/metadata_vgpu_type.c | 153 ++++
 drivers/vfio/pci/nvidia-vgpu/pf.h             | 145 ++++
 drivers/vfio/pci/nvidia-vgpu/rpc.c            | 254 ++++++
 drivers/vfio/pci/nvidia-vgpu/vfio.h           |  65 ++
 drivers/vfio/pci/nvidia-vgpu/vfio_access.c    | 313 ++++++++
 drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c   | 117 +++
 drivers/vfio/pci/nvidia-vgpu/vfio_main.c      | 730 ++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c     | 209 +++++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c           | 690 +++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c       | 450 +++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h       | 231 ++++++
 29 files changed, 4568 insertions(+)
 create mode 100644 Documentation/ABI/stable/sysfs-driver-nvidia-vgpu
 create mode 100644 Documentation/gpu/nvidia-vgpu.rst
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/Kconfig
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/Makefile
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/debug.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/debugfs.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/bootload.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/ecc.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/nv_vgpu_types.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/nvtypes.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata_vgpu_type.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/pf.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/rpc.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_access.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_main.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h

-- 
2.34.1





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux