On 6/3/2025 5:10 AM, David Hildenbrand wrote: > On 30.05.25 10:32, Chenyi Qiang wrote: >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >> discard") highlighted that subsystems like VFIO may disable RAM block >> discard. However, guest_memfd relies on discard operations for page >> conversion between private and shared memory, potentially leading to >> the stale IOMMU mapping issue when assigning hardware devices to >> confidential VMs via shared memory. To address this and allow shared >> device assignement, it is crucial to ensure the VFIO system refreshes >> its IOMMU mappings. >> >> RamDiscardManager is an existing interface (used by virtio-mem) to >> adjust VFIO mappings in relation to VM page assignment. Effectively page >> conversion is similar to hot-removing a page in one mode and adding it >> back in the other. Therefore, similar actions are required for page >> conversion events. Introduce the RamDiscardManager to guest_memfd to >> facilitate this process. >> >> Since guest_memfd is not an object, it cannot directly implement the >> RamDiscardManager interface. Implementing it in HostMemoryBackend is >> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks >> have a memory backend while others do not. Notably, virtual BIOS >> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a >> backend. >> >> To manage RAMBlocks with guest_memfd, define a new object named >> RamBlockAttributes to implement the RamDiscardManager interface. This >> object can store the guest_memfd information such as bitmap for shared >> memory and the registered listeners for event notification. In the >> context of RamDiscardManager, shared state is analogous to populated, and >> private state is signified as discarded. To notify the conversion events, >> a new state_change() helper is exported for the users to notify the >> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the >> shared mapping. >> >> Note that the memory state is tracked at the host page size granularity, >> as the minimum conversion size can be one page per request and VFIO >> expects the DMA mapping for a specific iova to be mapped and unmapped >> with the same granularity. Confidential VMs may perform partial >> conversions, such as conversions on small regions within larger ones. >> To prevent such invalid cases and until DMA mapping cut operation >> support is available, all operations are performed with 4K granularity. >> >> In addition, memory conversion failures cause QEMU to quit instead of >> resuming the guest or retrying the operation at present. It would be >> future work to add more error handling or rollback mechanisms once >> conversion failures are allowed. For example, in-place conversion of >> guest_memfd could retry the unmap operation during the conversion from >> shared to private. For now, keep the complex error handling out of the >> picture as it is not required. >> >> Signed-off-by: Chenyi Qiang <chenyi.qiang@xxxxxxxxx> >> --- >> Changes in v6: >> - Change the object type name from RamBlockAttribute to >> RamBlockAttributes. (David) >> - Save the associated RAMBlock instead MemoryRegion in >> RamBlockAttributes. (David) >> - Squash the state_change() helper introduction in this commit as >> well as the mixture conversion case handling. (David) >> - Change the block_size type from int to size_t and some cleanup in >> validation check. (Alexey) >> - Add a tracepoint to track the state changes. (Alexey) >> >> Changes in v5: >> - Revert to use RamDiscardManager interface instead of introducing >> new hierarchy of class to manage private/shared state, and keep >> using the new name of RamBlockAttribute compared with the >> MemoryAttributeManager in v3. >> - Use *simple* version of object_define and object_declare since the >> state_change() function is changed as an exported function instead >> of a virtual function in later patch. >> - Move the introduction of RamBlockAttribute field to this patch and >> rename it to ram_shared. (Alexey) >> - call the exit() when register/unregister failed. (Zhao) >> - Add the ram-block-attribute.c to Memory API related part in >> MAINTAINERS. >> >> Changes in v4: >> - Change the name from memory-attribute-manager to >> ram-block-attribute. >> - Implement the newly-introduced PrivateSharedManager instead of >> RamDiscardManager and change related commit message. >> - Define the new object in ramblock.h instead of adding a new file. >> --- >> MAINTAINERS | 1 + >> include/system/ramblock.h | 21 ++ >> system/meson.build | 1 + >> system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++ >> system/trace-events | 3 + >> 5 files changed, 506 insertions(+) >> create mode 100644 system/ram-block-attributes.c >> >> diff --git a/MAINTAINERS b/MAINTAINERS >> index 6dacd6d004..8ec39aa7f8 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -3149,6 +3149,7 @@ F: system/memory.c >> F: system/memory_mapping.c >> F: system/physmem.c >> F: system/memory-internal.h >> +F: system/ram-block-attributes.c >> F: scripts/coccinelle/memory-region-housekeeping.cocci >> Memory devices >> diff --git a/include/system/ramblock.h b/include/system/ramblock.h >> index d8a116ba99..1bab9e2dac 100644 >> --- a/include/system/ramblock.h >> +++ b/include/system/ramblock.h >> @@ -22,6 +22,10 @@ >> #include "exec/cpu-common.h" >> #include "qemu/rcu.h" >> #include "exec/ramlist.h" >> +#include "system/hostmem.h" >> + >> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes" >> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES) >> struct RAMBlock { >> struct rcu_head rcu; >> @@ -91,4 +95,21 @@ struct RAMBlock { >> ram_addr_t postcopy_length; >> }; >> +struct RamBlockAttributes { >> + Object parent; >> + >> + RAMBlock *ram_block; >> + >> + /* 1-setting of the bitmap represents ram is populated (shared) */ >> + unsigned bitmap_size; >> + unsigned long *bitmap; > > So, initially, all memory starts out as private, correct? Yes. > > I guess this mimics what kvm_set_phys_mem() ends up doing, when it does > the kvm_set_memory_attributes_private() call. > > So there is a short period of inconsistency, between creating the > RAMBlock and mapping it into the PA space. I initially had such a patch [1] to describe the inconsistency in RFC series. The bitmap was a 1-setting private bitmap at that time to keep consistent with guest_memfd and it required an explicit bitmap_fill(). [1] https://lore.kernel.org/qemu-devel/20240725072118.358923-6-chenyi.qiang@xxxxxxxxx/ > > It might be wroth spelling that out / documenting it somewhere. OK. I missed the above commit document after changing the bitmap to shared. I can add it back if need a new version. >