On 4/9/2025 5:57 PM, Alexey Kardashevskiy wrote: > > > On 7/4/25 17:49, Chenyi Qiang wrote: >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >> discard") highlighted that subsystems like VFIO may disable RAM block >> discard. However, guest_memfd relies on discard operations for page >> conversion between private and shared memory, potentially leading to >> stale IOMMU mapping issue when assigning hardware devices to >> confidential VMs via shared memory. To address this, it is crucial to >> ensure systems like VFIO refresh its IOMMU mappings. >> >> PrivateSharedManager is introduced to manage private and shared states in >> confidential VMs, similar to RamDiscardManager, which supports >> coordinated RAM discard in VFIO. Integrating PrivateSharedManager with >> guest_memfd can facilitate the adjustment of VFIO mappings in response >> to page conversion events. >> >> Since guest_memfd is not an object, it cannot directly implement the >> PrivateSharedManager interface. Implementing it in HostMemoryBackend is >> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks >> have a memory backend while others do not. > > HostMemoryBackend::mr::ram_block::guest_memfd? > And there is HostMemoryBackendMemfd too. HostMemoryBackend is the parent of HostMemoryBackendMemfd. It is also possible to use HostMemoryBackendFile or HostMemoryBackendRAM. > >> Notably, virtual BIOS >> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a >> backend. > > I thought private memory can be allocated from guest_memfd only. And it > is still not clear if this BIOS memory can be discarded or not, does it > change state during the VM lifetime? > (sorry I keep asking but I do not remember definitive answer). The BIOS region supports conversion as it is backed by guest_memfd. It can change the state but it never does during VM lifetime. > >> To manage RAMBlocks with guest_memfd, define a new object named >> RamBlockAttribute to implement the RamDiscardManager interface. This >> object stores guest_memfd information such as shared_bitmap, and handles >> page conversion notification. The memory state is tracked at the host >> page size granularity, as the minimum memory conversion size can be one >> page per request. Additionally, VFIO expects the DMA mapping for a >> specific iova to be mapped and unmapped with the same granularity. >> Confidential VMs may perform partial conversions, such as conversions on >> small regions within larger regions. To prevent invalid cases and until >> cut_mapping operation support is available, all operations are performed >> with 4K granularity. >> >> Signed-off-by: Chenyi Qiang <chenyi.qiang@xxxxxxxxx> >> --- >> Changes in v4: >> - Change the name from memory-attribute-manager to >> ram-block-attribute. >> - Implement the newly-introduced PrivateSharedManager instead of >> RamDiscardManager and change related commit message. >> - Define the new object in ramblock.h instead of adding a new file. >> >> Changes in v3: >> - Some rename (bitmap_size->shared_bitmap_size, >> first_one/zero_bit->first_bit, etc.) >> - Change shared_bitmap_size from uint32_t to unsigned >> - Return mgr->mr->ram_block->page_size in get_block_size() >> - Move set_ram_discard_manager() up to avoid a g_free() in failure >> case. >> - Add const for the memory_attribute_manager_get_block_size() >> - Unify the ReplayRamPopulate and ReplayRamDiscard and related >> callback. >> >> Changes in v2: >> - Rename the object name to MemoryAttributeManager >> - Rename the bitmap to shared_bitmap to make it more clear. >> - Remove block_size field and get it from a helper. In future, we >> can get the page_size from RAMBlock if necessary. >> - Remove the unncessary "struct" before GuestMemfdReplayData >> - Remove the unncessary g_free() for the bitmap >> - Add some error report when the callback failure for >> populated/discarded section. >> - Move the realize()/unrealize() definition to this patch. >> --- >> include/exec/ramblock.h | 24 +++ >> system/meson.build | 1 + >> system/ram-block-attribute.c | 282 +++++++++++++++++++++++++++++++++++ >> 3 files changed, 307 insertions(+) >> create mode 100644 system/ram-block-attribute.c >> >> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h >> index 0babd105c0..b8b5469db9 100644 >> --- a/include/exec/ramblock.h >> +++ b/include/exec/ramblock.h >> @@ -23,6 +23,10 @@ >> #include "cpu-common.h" >> #include "qemu/rcu.h" >> #include "exec/ramlist.h" >> +#include "system/hostmem.h" >> + >> +#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute" >> +OBJECT_DECLARE_TYPE(RamBlockAttribute, RamBlockAttributeClass, >> RAM_BLOCK_ATTRIBUTE) >> struct RAMBlock { >> struct rcu_head rcu; >> @@ -90,5 +94,25 @@ struct RAMBlock { >> */ >> ram_addr_t postcopy_length; >> }; >> + >> +struct RamBlockAttribute { >> + Object parent; >> + >> + MemoryRegion *mr; >> + >> + /* 1-setting of the bit represents the memory is populated >> (shared) */ > > It is either RamBlockShared, or it is a "generic" RamBlockAttribute > implementing a bitmap with a bit per page and no special meaning > (shared/private or discarded/populated). And if it is a generic > RamBlockAttribute, then this hunk from 09/13 (which should be in this > patch) should look like: > > > --- a/include/exec/ramblock.h > +++ b/include/exec/ramblock.h > @@ -46,6 +46,7 @@ struct RAMBlock { > int fd; > uint64_t fd_offset; > int guest_memfd; > + RamBlockAttribute *ram_shared; // and not "ram_block_attribute" > > Thanks, I prefer generic RamBlockAttribute as we can extend to manage private/shared/discarded states in the future if necessary. With the same reason, I hope to keep the variable name of ram_block_attribute. > > >> + unsigned shared_bitmap_size; >> + unsigned long *shared_bitmap; >> + >> + QLIST_HEAD(, PrivateSharedListener) psl_list; >> +}; >> + >> +struct RamBlockAttributeClass { >> + ObjectClass parent_class; >> +}; >> + >> +int ram_block_attribute_realize(RamBlockAttribute *attr, MemoryRegion >> *mr); >> +void ram_block_attribute_unrealize(RamBlockAttribute *attr); >> + >> #endif >> #endif >> diff --git a/system/meson.build b/system/meson.build >> index 4952f4b2c7..50a5a64f1c 100644 >> --- a/system/meson.build >> +++ b/system/meson.build >> @@ -15,6 +15,7 @@ system_ss.add(files( >> 'dirtylimit.c', >> 'dma-helpers.c', >> 'globals.c', >> + 'ram-block-attribute.c', >> 'memory_mapping.c', >> 'qdev-monitor.c', >> 'qtest.c', >> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c >> new file mode 100644 >> index 0000000000..283c03b354 >> --- /dev/null >> +++ b/system/ram-block-attribute.c >> @@ -0,0 +1,282 @@ >> +/* >> + * QEMU ram block attribute >> + * >> + * Copyright Intel >> + * >> + * Author: >> + * Chenyi Qiang <chenyi.qiang@xxxxxxxxx> >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2 or >> later. >> + * See the COPYING file in the top-level directory >> + * >> + */ >> + >> +#include "qemu/osdep.h" >> +#include "qemu/error-report.h" >> +#include "exec/ramblock.h" >> + >> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(RamBlockAttribute, >> + ram_block_attribute, >> + RAM_BLOCK_ATTRIBUTE, >> + OBJECT, >> + { TYPE_PRIVATE_SHARED_MANAGER }, >> + { }) >> + >> +static size_t ram_block_attribute_get_block_size(const >> RamBlockAttribute *attr) >> +{ >> + /* >> + * Because page conversion could be manipulated in the size of at >> least 4K or 4K aligned, >> + * Use the host page size as the granularity to track the memory >> attribute. >> + */ >> + g_assert(attr && attr->mr && attr->mr->ram_block); >> + g_assert(attr->mr->ram_block->page_size == >> qemu_real_host_page_size()); >> + return attr->mr->ram_block->page_size; >> +} >> + >> + >> +static bool ram_block_attribute_psm_is_shared(const >> GenericStateManager *gsm, >> + const >> MemoryRegionSection *section) >> +{ >> + const RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(gsm); >> + const int block_size = ram_block_attribute_get_block_size(attr); >> + uint64_t first_bit = section->offset_within_region / block_size; >> + uint64_t last_bit = first_bit + int128_get64(section->size) / >> block_size - 1; >> + unsigned long first_discard_bit; >> + >> + first_discard_bit = find_next_zero_bit(attr->shared_bitmap, >> last_bit + 1, first_bit); >> + return first_discard_bit > last_bit; >> +} >> + >> +typedef int (*ram_block_attribute_section_cb)(MemoryRegionSection *s, >> void *arg); >> + >> +static int ram_block_attribute_notify_shared_cb(MemoryRegionSection >> *section, void *arg) >> +{ >> + StateChangeListener *scl = arg; >> + >> + return scl->notify_to_state_set(scl, section); >> +} >> + >> +static int ram_block_attribute_notify_private_cb(MemoryRegionSection >> *section, void *arg) >> +{ >> + StateChangeListener *scl = arg; >> + >> + scl->notify_to_state_clear(scl, section); >> + return 0; >> +} >> + >> +static int ram_block_attribute_for_each_shared_section(const >> RamBlockAttribute *attr, >> + >> MemoryRegionSection *section, >> + void *arg, >> + >> ram_block_attribute_section_cb cb) >> +{ >> + unsigned long first_bit, last_bit; >> + uint64_t offset, size; >> + const int block_size = ram_block_attribute_get_block_size(attr); >> + int ret = 0; >> + >> + first_bit = section->offset_within_region / block_size; >> + first_bit = find_next_bit(attr->shared_bitmap, attr- >> >shared_bitmap_size, first_bit); >> + >> + while (first_bit < attr->shared_bitmap_size) { >> + MemoryRegionSection tmp = *section; >> + >> + offset = first_bit * block_size; >> + last_bit = find_next_zero_bit(attr->shared_bitmap, attr- >> >shared_bitmap_size, >> + first_bit + 1) - 1; >> + size = (last_bit - first_bit + 1) * block_size; >> + >> + if (!memory_region_section_intersect_range(&tmp, offset, >> size)) { >> + break; >> + } >> + >> + ret = cb(&tmp, arg); >> + if (ret) { >> + error_report("%s: Failed to notify RAM discard listener: >> %s", __func__, >> + strerror(-ret)); >> + break; >> + } >> + >> + first_bit = find_next_bit(attr->shared_bitmap, attr- >> >shared_bitmap_size, >> + last_bit + 2); >> + } >> + >> + return ret; >> +} >> + >> +static int ram_block_attribute_for_each_private_section(const >> RamBlockAttribute *attr, >> + >> MemoryRegionSection *section, >> + void *arg, >> + >> ram_block_attribute_section_cb cb) >> +{ >> + unsigned long first_bit, last_bit; >> + uint64_t offset, size; >> + const int block_size = ram_block_attribute_get_block_size(attr); >> + int ret = 0; >> + >> + first_bit = section->offset_within_region / block_size; >> + first_bit = find_next_zero_bit(attr->shared_bitmap, attr- >> >shared_bitmap_size, >> + first_bit); >> + >> + while (first_bit < attr->shared_bitmap_size) { >> + MemoryRegionSection tmp = *section; >> + >> + offset = first_bit * block_size; >> + last_bit = find_next_bit(attr->shared_bitmap, attr- >> >shared_bitmap_size, >> + first_bit + 1) - 1; >> + size = (last_bit - first_bit + 1) * block_size; >> + >> + if (!memory_region_section_intersect_range(&tmp, offset, >> size)) { >> + break; >> + } >> + >> + ret = cb(&tmp, arg); >> + if (ret) { >> + error_report("%s: Failed to notify RAM discard listener: >> %s", __func__, >> + strerror(-ret)); >> + break; >> + } >> + >> + first_bit = find_next_zero_bit(attr->shared_bitmap, attr- >> >shared_bitmap_size, >> + last_bit + 2); >> + } >> + >> + return ret; >> +} >> + >> +static uint64_t ram_block_attribute_psm_get_min_granularity(const >> GenericStateManager *gsm, >> + const >> MemoryRegion *mr) >> +{ >> + const RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(gsm); >> + >> + g_assert(mr == attr->mr); >> + return ram_block_attribute_get_block_size(attr); >> +} >> + >> +static void >> ram_block_attribute_psm_register_listener(GenericStateManager *gsm, >> + >> StateChangeListener *scl, >> + >> MemoryRegionSection *section) >> +{ >> + RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(gsm); >> + PrivateSharedListener *psl = container_of(scl, >> PrivateSharedListener, scl); >> + int ret; >> + >> + g_assert(section->mr == attr->mr); >> + scl->section = memory_region_section_new_copy(section); >> + >> + QLIST_INSERT_HEAD(&attr->psl_list, psl, next); >> + >> + ret = ram_block_attribute_for_each_shared_section(attr, section, >> scl, >> + >> ram_block_attribute_notify_shared_cb); >> + if (ret) { >> + error_report("%s: Failed to register RAM discard listener: >> %s", __func__, >> + strerror(-ret)); >> + } >> +} >> + >> +static void >> ram_block_attribute_psm_unregister_listener(GenericStateManager *gsm, >> + >> StateChangeListener *scl) >> +{ >> + RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(gsm); >> + PrivateSharedListener *psl = container_of(scl, >> PrivateSharedListener, scl); >> + int ret; >> + >> + g_assert(scl->section); >> + g_assert(scl->section->mr == attr->mr); >> + >> + ret = ram_block_attribute_for_each_shared_section(attr, scl- >> >section, scl, >> + >> ram_block_attribute_notify_private_cb); >> + if (ret) { >> + error_report("%s: Failed to unregister RAM discard listener: >> %s", __func__, >> + strerror(-ret)); >> + } >> + >> + memory_region_section_free_copy(scl->section); >> + scl->section = NULL; >> + QLIST_REMOVE(psl, next); >> +} >> + >> +typedef struct RamBlockAttributeReplayData { >> + ReplayStateChange fn; >> + void *opaque; >> +} RamBlockAttributeReplayData; >> + >> +static int ram_block_attribute_psm_replay_cb(MemoryRegionSection >> *section, void *arg) >> +{ >> + RamBlockAttributeReplayData *data = arg; >> + >> + return data->fn(section, data->opaque); >> +} >> + >> +static int ram_block_attribute_psm_replay_on_shared(const >> GenericStateManager *gsm, >> + >> MemoryRegionSection *section, >> + ReplayStateChange >> replay_fn, >> + void *opaque) >> +{ >> + RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(gsm); >> + RamBlockAttributeReplayData data = { .fn = replay_fn, .opaque = >> opaque }; >> + >> + g_assert(section->mr == attr->mr); >> + return ram_block_attribute_for_each_shared_section(attr, section, >> &data, >> + >> ram_block_attribute_psm_replay_cb); >> +} >> + >> +static int ram_block_attribute_psm_replay_on_private(const >> GenericStateManager *gsm, >> + >> MemoryRegionSection *section, >> + >> ReplayStateChange replay_fn, >> + void *opaque) >> +{ >> + RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(gsm); >> + RamBlockAttributeReplayData data = { .fn = replay_fn, .opaque = >> opaque }; >> + >> + g_assert(section->mr == attr->mr); >> + return ram_block_attribute_for_each_private_section(attr, >> section, &data, >> + >> ram_block_attribute_psm_replay_cb); >> +} >> + >> +int ram_block_attribute_realize(RamBlockAttribute *attr, MemoryRegion >> *mr) >> +{ >> + uint64_t shared_bitmap_size; >> + const int block_size = qemu_real_host_page_size(); >> + int ret; >> + >> + shared_bitmap_size = ROUND_UP(mr->size, block_size) / block_size; >> + >> + attr->mr = mr; >> + ret = memory_region_set_generic_state_manager(mr, >> GENERIC_STATE_MANAGER(attr)); >> + if (ret) { >> + return ret; >> + } >> + attr->shared_bitmap_size = shared_bitmap_size; >> + attr->shared_bitmap = bitmap_new(shared_bitmap_size); >> + >> + return ret; >> +} >> + >> +void ram_block_attribute_unrealize(RamBlockAttribute *attr) >> +{ >> + g_free(attr->shared_bitmap); >> + memory_region_set_generic_state_manager(attr->mr, NULL); >> +} >> + >> +static void ram_block_attribute_init(Object *obj) >> +{ >> + RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(obj); >> + >> + QLIST_INIT(&attr->psl_list); >> +} >> + >> +static void ram_block_attribute_finalize(Object *obj) >> +{ >> +} >> + >> +static void ram_block_attribute_class_init(ObjectClass *oc, void *data) >> +{ >> + GenericStateManagerClass *gsmc = GENERIC_STATE_MANAGER_CLASS(oc); >> + >> + gsmc->get_min_granularity = >> ram_block_attribute_psm_get_min_granularity; >> + gsmc->register_listener = ram_block_attribute_psm_register_listener; >> + gsmc->unregister_listener = >> ram_block_attribute_psm_unregister_listener; >> + gsmc->is_state_set = ram_block_attribute_psm_is_shared; >> + gsmc->replay_on_state_set = >> ram_block_attribute_psm_replay_on_shared; >> + gsmc->replay_on_state_clear = >> ram_block_attribute_psm_replay_on_private; >> +} >