On 5/26/2025 7:37 PM, Cédric Le Goater wrote: > On 5/20/25 12:28, Chenyi Qiang wrote: >> This is the v5 series of the shared device assignment support. >> >> As discussed in the v4 series [1], the GenericStateManager parent class >> and PrivateSharedManager child interface were deemed to be in the wrong >> direction. This series reverts back to the original single >> RamDiscardManager interface and puts it as future work to allow the >> co-existence of multiple pairs of state management. For example, if we >> want to have virtio-mem co-exist with guest_memfd, it will need a new >> framework to combine the private/shared/discard states [2]. >> >> Another change since the last version is the error handling of memory >> conversion. Currently, the failure of kvm_convert_memory() causes QEMU >> to quit instead of resuming the guest. The complex rollback operation >> doesn't add value and merely adds code that is difficult to test. >> Although in the future, it is more likely to encounter more errors on >> conversion paths like unmap failure on shared to private in-place >> conversion. This series keeps complex error handling out of the picture >> for now and attaches related handling at the end of the series for >> future extension. >> >> Apart from the above two parts with future work, there's some >> optimization work in the future, i.e., using other more memory-efficient >> mechanism to track ranges of contiguous states instead of a bitmap [3]. >> This series still uses a bitmap for simplicity. >> The overview of this series: >> - Patch 1-3: Preparation patches. These include function exposure and >> some definition changes to return values. >> - Patch 4-5: Introduce a new object to implement RamDiscardManager >> interface and a helper to notify the shared/private state change. >> - Patch 6: Store the new object including guest_memfd information in >> RAMBlock. Register the RamDiscardManager instance to the target >> RAMBlock's MemoryRegion so that the RamDiscardManager users can run in >> the specific path. >> - Patch 7: Unlock the coordinate discard so that the shared device >> assignment (VFIO) can work with guest_memfd. After this patch, the >> basic device assignement functionality can work properly. >> - Patch 8-9: Some cleanup work. Move the state change handling into a >> RamDiscardListener so that it can be invoked together with the VFIO >> listener by the state_change() call. This series dropped the priority >> support in v4 which is required by in-place conversions, because the >> conversion path will likely change. >> - Patch 10: More complex error handing including rollback and mixture >> states conversion case. >> >> More small changes or details can be found in the individual patches. >> >> --- >> Original cover letter: >> >> Background >> ========== >> Confidential VMs have two classes of memory: shared and private memory. >> Shared memory is accessible from the host/VMM while private memory is >> not. Confidential VMs can decide which memory is shared/private and >> convert memory between shared/private at runtime. >> >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >> private memory. In current implementation, shared memory is allocated >> with normal methods (e.g. mmap or fallocate) while private memory is >> allocated from guest_memfd. When a VM performs memory conversions, QEMU >> frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from >> one side, and allocates new pages from the other side. This will cause a >> stale IOMMU mapping issue mentioned in [4] when we try to enable shared >> device assignment in confidential VMs. >> >> Solution >> ======== >> The key to enable shared device assignment is to update the IOMMU >> mappings >> on page conversion. RamDiscardManager, an existing interface currently >> utilized by virtio-mem, offers a means to modify IOMMU mappings in >> accordance with VM page assignment. Although the required operations in >> VFIO for page conversion are similar to memory plug/unplug, the states of >> private/shared are different from discard/populated. We want a similar >> mechanism with RamDiscardManager but used to manage the state of private >> and shared. >> >> This series introduce a new parent abstract class to manage a pair of >> opposite states with RamDiscardManager as its child to manage >> populate/discard states, and introduce a new child class, >> PrivateSharedManager, which can also utilize the same infrastructure to >> notify VFIO of page conversions. >> >> Relationship with in-place page conversion >> ========================================== >> To support 1G page support for guest_memfd [5], the current direction >> is to >> allow mmap() of guest_memfd to userspace so that both private and shared >> memory can use the same physical pages as the backend. This in-place page >> conversion design eliminates the need to discard pages during shared/ >> private >> conversions. However, device assignment will still be blocked because the >> in-place page conversion will reject the conversion when the page is >> pinned >> by VFIO. >> >> To address this, the key difference lies in the sequence of VFIO map/ >> unmap >> operations and the page conversion. It can be adjusted to achieve >> unmap-before-conversion-to-private and map-after-conversion-to-shared, >> ensuring compatibility with guest_memfd. >> >> Limitation >> ========== >> One limitation is that VFIO expects the DMA mapping for a specific IOVA >> to be mapped and unmapped with the same granularity. The guest may >> perform partial conversions, such as converting a small region within a >> larger region. To prevent such invalid cases, all operations are >> performed with 4K granularity. This could be optimized after the >> cut_mapping operation[6] is introduced in future. We can alway perform a >> split-before-unmap if partial conversions happen. If the split succeeds, >> the unmap will succeed and be atomic. If the split fails, the unmap >> process fails. >> >> Testing >> ======= >> This patch series is tested based on TDX patches available at: >> KVM: https://github.com/intel/tdx/tree/kvm-coco-queue-snapshot/kvm- >> coco-queue-snapshot-20250408 >> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream- >> snapshot-2025-05-20 >> >> Because the new features like cut_mapping operation will only be >> support in iommufd. >> It is recommended to use the iommufd-backed VFIO with the qemu command: > > Is it recommended or required ? If the VFIO IOMMU type1 backend is not > supported for confidential VMs, QEMU should fail to start. VFIO IOMMU type1 backend is also supported but need to increase the dma_entry_limit parameter, as this series currently do the map/unmap with 4K granularity. > > Please add Alex Williamson and I to the Cc: list. Sure, will do in next version. > > Thanks, > > C. >