On 6/5/2025 8:35 AM, Alexey Kardashevskiy wrote: > > > On 4/6/25 21:04, Alexey Kardashevskiy wrote: >> >> >> On 30/5/25 18:32, Chenyi Qiang wrote: >>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >>> discard") highlighted that subsystems like VFIO may disable RAM block >>> discard. However, guest_memfd relies on discard operations for page >>> conversion between private and shared memory, potentially leading to >>> the stale IOMMU mapping issue when assigning hardware devices to >>> confidential VMs via shared memory. To address this and allow shared >>> device assignement, it is crucial to ensure the VFIO system refreshes >>> its IOMMU mappings. >>> >>> RamDiscardManager is an existing interface (used by virtio-mem) to >>> adjust VFIO mappings in relation to VM page assignment. Effectively page >>> conversion is similar to hot-removing a page in one mode and adding it >>> back in the other. Therefore, similar actions are required for page >>> conversion events. Introduce the RamDiscardManager to guest_memfd to >>> facilitate this process. >>> >>> Since guest_memfd is not an object, it cannot directly implement the >>> RamDiscardManager interface. Implementing it in HostMemoryBackend is >>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks >>> have a memory backend while others do not. Notably, virtual BIOS >>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a >>> backend. >>> >>> To manage RAMBlocks with guest_memfd, define a new object named >>> RamBlockAttributes to implement the RamDiscardManager interface. This >>> object can store the guest_memfd information such as bitmap for shared >>> memory and the registered listeners for event notification. In the >>> context of RamDiscardManager, shared state is analogous to populated, >>> and >>> private state is signified as discarded. To notify the conversion >>> events, >>> a new state_change() helper is exported for the users to notify the >>> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the >>> shared mapping. >>> >>> Note that the memory state is tracked at the host page size granularity, >>> as the minimum conversion size can be one page per request and VFIO >>> expects the DMA mapping for a specific iova to be mapped and unmapped >>> with the same granularity. Confidential VMs may perform partial >>> conversions, such as conversions on small regions within larger ones. >>> To prevent such invalid cases and until DMA mapping cut operation >>> support is available, all operations are performed with 4K granularity. >>> >>> In addition, memory conversion failures cause QEMU to quit instead of >>> resuming the guest or retrying the operation at present. It would be >>> future work to add more error handling or rollback mechanisms once >>> conversion failures are allowed. For example, in-place conversion of >>> guest_memfd could retry the unmap operation during the conversion from >>> shared to private. For now, keep the complex error handling out of the >>> picture as it is not required. >>> >>> Signed-off-by: Chenyi Qiang <chenyi.qiang@xxxxxxxxx> >>> --- >>> Changes in v6: >>> - Change the object type name from RamBlockAttribute to >>> RamBlockAttributes. (David) >>> - Save the associated RAMBlock instead MemoryRegion in >>> RamBlockAttributes. (David) >>> - Squash the state_change() helper introduction in this commit as >>> well as the mixture conversion case handling. (David) >>> - Change the block_size type from int to size_t and some cleanup in >>> validation check. (Alexey) >>> - Add a tracepoint to track the state changes. (Alexey) >>> >>> Changes in v5: >>> - Revert to use RamDiscardManager interface instead of introducing >>> new hierarchy of class to manage private/shared state, and keep >>> using the new name of RamBlockAttribute compared with the >>> MemoryAttributeManager in v3. >>> - Use *simple* version of object_define and object_declare since >>> the >>> state_change() function is changed as an exported function >>> instead >>> of a virtual function in later patch. >>> - Move the introduction of RamBlockAttribute field to this patch >>> and >>> rename it to ram_shared. (Alexey) >>> - call the exit() when register/unregister failed. (Zhao) >>> - Add the ram-block-attribute.c to Memory API related part in >>> MAINTAINERS. >>> >>> Changes in v4: >>> - Change the name from memory-attribute-manager to >>> ram-block-attribute. >>> - Implement the newly-introduced PrivateSharedManager instead of >>> RamDiscardManager and change related commit message. >>> - Define the new object in ramblock.h instead of adding a new file. >>> --- >>> MAINTAINERS | 1 + >>> include/system/ramblock.h | 21 ++ >>> system/meson.build | 1 + >>> system/ram-block-attributes.c | 480 ++++++++++++++++++++++++++++++++++ >>> system/trace-events | 3 + >>> 5 files changed, 506 insertions(+) >>> create mode 100644 system/ram-block-attributes.c >>> >>> diff --git a/MAINTAINERS b/MAINTAINERS >>> index 6dacd6d004..8ec39aa7f8 100644 >>> --- a/MAINTAINERS >>> +++ b/MAINTAINERS >>> @@ -3149,6 +3149,7 @@ F: system/memory.c >>> F: system/memory_mapping.c >>> F: system/physmem.c >>> F: system/memory-internal.h >>> +F: system/ram-block-attributes.c >>> F: scripts/coccinelle/memory-region-housekeeping.cocci >>> Memory devices >>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h >>> index d8a116ba99..1bab9e2dac 100644 >>> --- a/include/system/ramblock.h >>> +++ b/include/system/ramblock.h >>> @@ -22,6 +22,10 @@ >>> #include "exec/cpu-common.h" >>> #include "qemu/rcu.h" >>> #include "exec/ramlist.h" >>> +#include "system/hostmem.h" >>> + >>> +#define TYPE_RAM_BLOCK_ATTRIBUTES "ram-block-attributes" >>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttributes, RAM_BLOCK_ATTRIBUTES) >>> struct RAMBlock { >>> struct rcu_head rcu; >>> @@ -91,4 +95,21 @@ struct RAMBlock { >>> ram_addr_t postcopy_length; >>> }; >>> +struct RamBlockAttributes { >>> + Object parent; >>> + >>> + RAMBlock *ram_block; >>> + >>> + /* 1-setting of the bitmap represents ram is populated (shared) */ >>> + unsigned bitmap_size; >>> + unsigned long *bitmap; >>> + >>> + QLIST_HEAD(, RamDiscardListener) rdl_list; >>> +}; >>> + >>> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block); >>> +void ram_block_attributes_destroy(RamBlockAttributes *attr); >>> +int ram_block_attributes_state_change(RamBlockAttributes *attr, >>> uint64_t offset, >>> + uint64_t size, bool to_discard); >>> + >>> #endif >>> diff --git a/system/meson.build b/system/meson.build >>> index c2f0082766..2747dbde80 100644 >>> --- a/system/meson.build >>> +++ b/system/meson.build >>> @@ -17,6 +17,7 @@ libsystem_ss.add(files( >>> 'dma-helpers.c', >>> 'globals.c', >>> 'ioport.c', >>> + 'ram-block-attributes.c', >>> 'memory_mapping.c', >>> 'memory.c', >>> 'physmem.c', >>> diff --git a/system/ram-block-attributes.c b/system/ram-block- >>> attributes.c >>> new file mode 100644 >>> index 0000000000..514252413f >>> --- /dev/null >>> +++ b/system/ram-block-attributes.c >>> @@ -0,0 +1,480 @@ >>> +/* >>> + * QEMU ram block attributes >>> + * >>> + * Copyright Intel >>> + * >>> + * Author: >>> + * Chenyi Qiang <chenyi.qiang@xxxxxxxxx> >>> + * >>> + * This work is licensed under the terms of the GNU GPL, version 2 >>> or later. >>> + * See the COPYING file in the top-level directory >>> + * >>> + */ >>> + >>> +#include "qemu/osdep.h" >>> +#include "qemu/error-report.h" >>> +#include "system/ramblock.h" >>> +#include "trace.h" >>> + >>> +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttributes, >>> + ram_block_attributes, >>> + RAM_BLOCK_ATTRIBUTES, >>> + OBJECT, >>> + { TYPE_RAM_DISCARD_MANAGER }, >>> + { }) >>> + >>> +static size_t >>> +ram_block_attributes_get_block_size(const RamBlockAttributes *attr) >>> +{ >>> + /* >>> + * Because page conversion could be manipulated in the size of >>> at least 4K >>> + * or 4K aligned, Use the host page size as the granularity to >>> track the >>> + * memory attribute. >>> + */ >>> + g_assert(attr && attr->ram_block); >>> + g_assert(attr->ram_block->page_size == qemu_real_host_page_size()); >>> + return attr->ram_block->page_size; >>> +} >>> + >>> + >>> +static bool >>> +ram_block_attributes_rdm_is_populated(const RamDiscardManager *rdm, >>> + const MemoryRegionSection >>> *section) >>> +{ >>> + const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm); >>> + const size_t block_size = >>> ram_block_attributes_get_block_size(attr); >>> + const uint64_t first_bit = section->offset_within_region / >>> block_size; >>> + const uint64_t last_bit = first_bit + int128_get64(section- >>> >size) / block_size - 1; >>> + unsigned long first_discarded_bit; >>> + >>> + first_discarded_bit = find_next_zero_bit(attr->bitmap, last_bit >>> + 1, >>> + first_bit); >>> + return first_discarded_bit > last_bit; >>> +} >>> + >>> +typedef int (*ram_block_attributes_section_cb)(MemoryRegionSection *s, >>> + void *arg); >>> + >>> +static int >>> +ram_block_attributes_notify_populate_cb(MemoryRegionSection *section, >>> + void *arg) >>> +{ >>> + RamDiscardListener *rdl = arg; >>> + >>> + return rdl->notify_populate(rdl, section); >>> +} >>> + >>> +static int >>> +ram_block_attributes_notify_discard_cb(MemoryRegionSection *section, >>> + void *arg) >>> +{ >>> + RamDiscardListener *rdl = arg; >>> + >>> + rdl->notify_discard(rdl, section); >>> + return 0; >>> +} >>> + >>> +static int >>> +ram_block_attributes_for_each_populated_section(const >>> RamBlockAttributes *attr, >>> + MemoryRegionSection >>> *section, >>> + void *arg, >>> + >>> ram_block_attributes_section_cb cb) >>> +{ >>> + unsigned long first_bit, last_bit; >>> + uint64_t offset, size; >>> + const size_t block_size = >>> ram_block_attributes_get_block_size(attr); >>> + int ret = 0; >>> + >>> + first_bit = section->offset_within_region / block_size; >>> + first_bit = find_next_bit(attr->bitmap, attr->bitmap_size, >>> + first_bit); >>> + >>> + while (first_bit < attr->bitmap_size) { >>> + MemoryRegionSection tmp = *section; >>> + >>> + offset = first_bit * block_size; >>> + last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size, >>> + first_bit + 1) - 1; >>> + size = (last_bit - first_bit + 1) * block_size; >>> + >>> + if (!memory_region_section_intersect_range(&tmp, offset, >>> size)) { >>> + break; >>> + } >>> + >>> + ret = cb(&tmp, arg); >>> + if (ret) { >>> + error_report("%s: Failed to notify RAM discard listener: >>> %s", >>> + __func__, strerror(-ret)); >>> + break; >>> + } >>> + >>> + first_bit = find_next_bit(attr->bitmap, attr->bitmap_size, >>> + last_bit + 2); >>> + } >>> + >>> + return ret; >>> +} >>> + >>> +static int >>> +ram_block_attributes_for_each_discarded_section(const >>> RamBlockAttributes *attr, >>> + MemoryRegionSection >>> *section, >>> + void *arg, >>> + >>> ram_block_attributes_section_cb cb) >>> +{ >>> + unsigned long first_bit, last_bit; >>> + uint64_t offset, size; >>> + const size_t block_size = >>> ram_block_attributes_get_block_size(attr); >>> + int ret = 0; >>> + >>> + first_bit = section->offset_within_region / block_size; >>> + first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size, >>> + first_bit); >>> + >>> + while (first_bit < attr->bitmap_size) { >>> + MemoryRegionSection tmp = *section; >>> + >>> + offset = first_bit * block_size; >>> + last_bit = find_next_bit(attr->bitmap, attr->bitmap_size, >>> + first_bit + 1) - 1; >>> + size = (last_bit - first_bit + 1) * block_size; >>> + >>> + if (!memory_region_section_intersect_range(&tmp, offset, >>> size)) { >>> + break; >>> + } >>> + >>> + ret = cb(&tmp, arg); >>> + if (ret) { >>> + error_report("%s: Failed to notify RAM discard listener: >>> %s", >>> + __func__, strerror(-ret)); >>> + break; >>> + } >>> + >>> + first_bit = find_next_zero_bit(attr->bitmap, >>> + attr->bitmap_size, >>> + last_bit + 2); >>> + } >>> + >>> + return ret; >>> +} >>> + >>> +static uint64_t >>> +ram_block_attributes_rdm_get_min_granularity(const RamDiscardManager >>> *rdm, >>> + const MemoryRegion *mr) >>> +{ >>> + const RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm); >>> + >>> + g_assert(mr == attr->ram_block->mr); >>> + return ram_block_attributes_get_block_size(attr); >>> +} >>> + >>> +static void >>> +ram_block_attributes_rdm_register_listener(RamDiscardManager *rdm, >>> + RamDiscardListener *rdl, >>> + MemoryRegionSection >>> *section) >>> +{ >>> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm); >>> + int ret; >>> + >>> + g_assert(section->mr == attr->ram_block->mr); >>> + rdl->section = memory_region_section_new_copy(section); >>> + >>> + QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next); >>> + >>> + ret = ram_block_attributes_for_each_populated_section(attr, >>> section, rdl, >>> + >>> ram_block_attributes_notify_populate_cb); >>> + if (ret) { >>> + error_report("%s: Failed to register RAM discard listener: %s", >>> + __func__, strerror(-ret)); >>> + exit(1); >>> + } >>> +} >>> + >>> +static void >>> +ram_block_attributes_rdm_unregister_listener(RamDiscardManager *rdm, >>> + RamDiscardListener *rdl) >>> +{ >>> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm); >>> + int ret; >>> + >>> + g_assert(rdl->section); >>> + g_assert(rdl->section->mr == attr->ram_block->mr); >>> + >>> + if (rdl->double_discard_supported) { >>> + rdl->notify_discard(rdl, rdl->section); >>> + } else { >>> + ret = ram_block_attributes_for_each_populated_section(attr, >>> + rdl->section, rdl, >>> ram_block_attributes_notify_discard_cb); >>> + if (ret) { >>> + error_report("%s: Failed to unregister RAM discard >>> listener: %s", >>> + __func__, strerror(-ret)); >>> + exit(1); >>> + } >>> + } >>> + >>> + memory_region_section_free_copy(rdl->section); >>> + rdl->section = NULL; >>> + QLIST_REMOVE(rdl, next); >>> +} >>> + >>> +typedef struct RamBlockAttributesReplayData { >>> + ReplayRamDiscardState fn; >>> + void *opaque; >>> +} RamBlockAttributesReplayData; >>> + >>> +static int ram_block_attributes_rdm_replay_cb(MemoryRegionSection >>> *section, >>> + void *arg) >>> +{ >>> + RamBlockAttributesReplayData *data = arg; >>> + >>> + return data->fn(section, data->opaque); >>> +} >>> + >>> +static int >>> +ram_block_attributes_rdm_replay_populated(const RamDiscardManager *rdm, >>> + MemoryRegionSection *section, >>> + ReplayRamDiscardState >>> replay_fn, >>> + void *opaque) >>> +{ >>> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm); >>> + RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = >>> opaque }; >>> + >>> + g_assert(section->mr == attr->ram_block->mr); >>> + return ram_block_attributes_for_each_populated_section(attr, >>> section, &data, >>> + >>> ram_block_attributes_rdm_replay_cb); >>> +} >>> + >>> +static int >>> +ram_block_attributes_rdm_replay_discarded(const RamDiscardManager *rdm, >>> + MemoryRegionSection *section, >>> + ReplayRamDiscardState >>> replay_fn, >>> + void *opaque) >>> +{ >>> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(rdm); >>> + RamBlockAttributesReplayData data = { .fn = replay_fn, .opaque = >>> opaque }; >>> + >>> + g_assert(section->mr == attr->ram_block->mr); >>> + return ram_block_attributes_for_each_discarded_section(attr, >>> section, &data, >>> + >>> ram_block_attributes_rdm_replay_cb); >>> +} >>> + >>> +static bool >>> +ram_block_attributes_is_valid_range(RamBlockAttributes *attr, >>> uint64_t offset, >>> + uint64_t size) >>> +{ >>> + MemoryRegion *mr = attr->ram_block->mr; >>> + >>> + g_assert(mr); >>> + >>> + uint64_t region_size = memory_region_size(mr); >>> + const size_t block_size = >>> ram_block_attributes_get_block_size(attr); >>> + >>> + if (!QEMU_IS_ALIGNED(offset, block_size) || >>> + !QEMU_IS_ALIGNED(size, block_size)) { >>> + return false; >>> + } >>> + if (offset + size <= offset) { >>> + return false; >>> + } >>> + if (offset + size > region_size) { >>> + return false; >>> + } >>> + return true; >>> +} >>> + >>> +static void ram_block_attributes_notify_discard(RamBlockAttributes >>> *attr, >>> + uint64_t offset, >>> + uint64_t size) >>> +{ >>> + RamDiscardListener *rdl; >>> + >>> + QLIST_FOREACH(rdl, &attr->rdl_list, next) { >>> + MemoryRegionSection tmp = *rdl->section; >>> + >>> + if (!memory_region_section_intersect_range(&tmp, offset, >>> size)) { >>> + continue; >>> + } >>> + rdl->notify_discard(rdl, &tmp); >>> + } >>> +} >>> + >>> +static int >>> +ram_block_attributes_notify_populate(RamBlockAttributes *attr, >>> + uint64_t offset, uint64_t size) >>> +{ >>> + RamDiscardListener *rdl; >>> + int ret = 0; >>> + >>> + QLIST_FOREACH(rdl, &attr->rdl_list, next) { >>> + MemoryRegionSection tmp = *rdl->section; >>> + >>> + if (!memory_region_section_intersect_range(&tmp, offset, >>> size)) { >>> + continue; >>> + } >>> + ret = rdl->notify_populate(rdl, &tmp); >>> + if (ret) { >>> + break; >>> + } >>> + } >>> + >>> + return ret; >>> +} >>> + >>> +static bool >>> ram_block_attributes_is_range_populated(RamBlockAttributes *attr, >>> + uint64_t offset, >>> + uint64_t size) >>> +{ >>> + const size_t block_size = >>> ram_block_attributes_get_block_size(attr); >>> + const unsigned long first_bit = offset / block_size; >>> + const unsigned long last_bit = first_bit + (size / block_size) - 1; >>> + unsigned long found_bit; >>> + >>> + found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1, >>> + first_bit); >>> + return found_bit > last_bit; >>> +} >>> + >>> +static bool >>> +ram_block_attributes_is_range_discarded(RamBlockAttributes *attr, >>> + uint64_t offset, uint64_t size) >>> +{ >>> + const size_t block_size = >>> ram_block_attributes_get_block_size(attr); >>> + const unsigned long first_bit = offset / block_size; >>> + const unsigned long last_bit = first_bit + (size / block_size) - 1; >>> + unsigned long found_bit; >>> + >>> + found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit); >>> + return found_bit > last_bit; >>> +} >>> + >>> +int ram_block_attributes_state_change(RamBlockAttributes *attr, >>> + uint64_t offset, uint64_t size, >>> + bool to_discard) >>> +{ >>> + const size_t block_size = >>> ram_block_attributes_get_block_size(attr); >>> + const unsigned long first_bit = offset / block_size; >>> + const unsigned long nbits = size / block_size; >>> + bool is_range_discarded, is_range_populated; >> >> Can be reduced to "discarded" and "populated". >> >>> + const uint64_t end = offset + size; >>> + unsigned long bit; >>> + uint64_t cur; >>> + int ret = 0; >>> + >>> + if (!ram_block_attributes_is_valid_range(attr, offset, size)) { >>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >>> + __func__, offset, size); >>> + return -EINVAL; >>> + } >>> + >>> + is_range_discarded = >>> ram_block_attributes_is_range_discarded(attr, offset, >>> + size); >> >> See - needlessly long line. >> >>> + is_range_populated = >>> ram_block_attributes_is_range_populated(attr, offset, >>> + size); >> >> If ram_block_attributes_is_range_populated() returned >> (found_bit*block_size), you could tell from a single call if it is >> populated (found_bit == size) or discarded (found_bit == 0), otherwise >> it is a mix (and dump just this number in the tracepoint below). >> >> And then ditch ram_block_attributes_is_range_discarded() which is >> practically cut-n-paste. And then open code >> ram_block_attributes_is_range_populated(). > > oops, cannot just drop find_next_bit(), my bad, need both > find_next_bit() and find_next_zero_bit(). My point still stands though - Yes. a single call is insufficient as found_bit == 0 can also be the mixed case. > if this is coded right here without helpers - it will look simpler. Thanks, Do you mean unwrap the helper functions? Since there are no other callers, you are right, it can be simpler.