On 6/4/2025 7:04 PM, Alexey Kardashevskiy wrote: > > > On 30/5/25 18:32, Chenyi Qiang wrote: >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >> discard") highlighted that subsystems like VFIO may disable RAM block >> discard. However, guest_memfd relies on discard operations for page >> conversion between private and shared memory, potentially leading to >> the stale IOMMU mapping issue when assigning hardware devices to >> confidential VMs via shared memory. To address this and allow shared >> device assignement, it is crucial to ensure the VFIO system refreshes >> its IOMMU mappings. >> >> RamDiscardManager is an existing interface (used by virtio-mem) to >> adjust VFIO mappings in relation to VM page assignment. Effectively page >> conversion is similar to hot-removing a page in one mode and adding it >> back in the other. Therefore, similar actions are required for page >> conversion events. Introduce the RamDiscardManager to guest_memfd to >> facilitate this process. >> >> Since guest_memfd is not an object, it cannot directly implement the >> RamDiscardManager interface. Implementing it in HostMemoryBackend is >> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks >> have a memory backend while others do not. Notably, virtual BIOS >> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a >> backend. >> >> To manage RAMBlocks with guest_memfd, define a new object named >> RamBlockAttributes to implement the RamDiscardManager interface. This >> object can store the guest_memfd information such as bitmap for shared >> memory and the registered listeners for event notification. In the >> context of RamDiscardManager, shared state is analogous to populated, and >> private state is signified as discarded. To notify the conversion events, >> a new state_change() helper is exported for the users to notify the >> listeners like VFIO, so that VFIO can dynamically DMA map/unmap the >> shared mapping. >> >> Note that the memory state is tracked at the host page size granularity, >> as the minimum conversion size can be one page per request and VFIO >> expects the DMA mapping for a specific iova to be mapped and unmapped >> with the same granularity. Confidential VMs may perform partial >> conversions, such as conversions on small regions within larger ones. >> To prevent such invalid cases and until DMA mapping cut operation >> support is available, all operations are performed with 4K granularity. >> >> In addition, memory conversion failures cause QEMU to quit instead of >> resuming the guest or retrying the operation at present. It would be >> future work to add more error handling or rollback mechanisms once >> conversion failures are allowed. For example, in-place conversion of >> guest_memfd could retry the unmap operation during the conversion from >> shared to private. For now, keep the complex error handling out of the >> picture as it is not required. >> >> Signed-off-by: Chenyi Qiang <chenyi.qiang@xxxxxxxxx> [...] >> + >> +int ram_block_attributes_state_change(RamBlockAttributes *attr, >> + uint64_t offset, uint64_t size, >> + bool to_discard) >> +{ >> + const size_t block_size = ram_block_attributes_get_block_size(attr); >> + const unsigned long first_bit = offset / block_size; >> + const unsigned long nbits = size / block_size; >> + bool is_range_discarded, is_range_populated; > > Can be reduced to "discarded" and "populated". [...] > >> + const uint64_t end = offset + size; >> + unsigned long bit; >> + uint64_t cur; >> + int ret = 0; >> + >> + if (!ram_block_attributes_is_valid_range(attr, offset, size)) { >> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >> + __func__, offset, size); >> + return -EINVAL; >> + } >> + >> + is_range_discarded = >> ram_block_attributes_is_range_discarded(attr, offset, >> + size); > > See - needlessly long line. Yes, doing the rename can avoid long line. > >> + is_range_populated = >> ram_block_attributes_is_range_populated(attr, offset, >> + size); > > If ram_block_attributes_is_range_populated() returned > (found_bit*block_size), you could tell from a single call if it is > populated (found_bit == size) or discarded (found_bit == 0), otherwise > it is a mix (and dump just this number in the tracepoint below). > > And then ditch ram_block_attributes_is_range_discarded() which is > practically cut-n-paste. And then open code > ram_block_attributes_is_range_populated(). > > These two are not used elsewhere anyway. > >> + >> + trace_ram_block_attributes_state_change(offset, size, >> + is_range_discarded ? >> "discarded" : >> + is_range_populated ? >> "populated" : >> + "mixture", >> + to_discard ? "discarded" : >> + "populated"); > > > I'd just dump 3 numbers (is_range_discarded, is_range_populated, > to_discard) in the tracepoint as: > > ram_block_attributes_state_change(uint64_t offset, uint64_t size, int > discarded, int populated, int to_discard) "offset 0x%"PRIx64" size > 0x%"PRIx64" discarded=%d populated=%d to_discard=%d" Maybe a string of "from xxx to xxx" is more straightforward and it can also cover your information. Anyway I don't think they have much difference. > > > >> + if (to_discard) { >> + if (is_range_discarded) { >> + /* Already private */ >> + } else if (is_range_populated) { >> + /* Completely shared */ >> + bitmap_clear(attr->bitmap, first_bit, nbits); >> + ram_block_attributes_notify_discard(attr, offset, size); >> + } else { >> + /* Unexpected mixture: process individual blocks */ >> + for (cur = offset; cur < end; cur += block_size) { > > imho a little bit more accurate to: > > for (bit = first_bit; bit < first_bit + nbits; ++bit) { > > as you already have calculated first_bit, nbits... > >> + bit = cur / block_size; > > ... and drop this ... > >> + if (!test_bit(bit, attr->bitmap)) { >> + continue; >> + } >> + clear_bit(bit, attr->bitmap); >> + ram_block_attributes_notify_discard(attr, cur, >> block_size); > > .. and do: ram_block_attributes_notify_discard(attr, bit * block_size, > block_size); > > Then you can drop @cur which is used in one place inside the loop. Yes, looks it can reduce some lines. > > >> + } >> + } >> + } else { >> + if (is_range_populated) { >> + /* Already shared */ >> + } else if (is_range_discarded) { >> + /* Complete private */ > > s/Complete/Completely/ > >> + bitmap_set(attr->bitmap, first_bit, nbits); >> + ret = ram_block_attributes_notify_populate(attr, offset, >> size); >> + } else { >> + /* Unexpected mixture: process individual blocks */ >> + for (cur = offset; cur < end; cur += block_size) { >> + bit = cur / block_size; >> + if (test_bit(bit, attr->bitmap)) { >> + continue; >> + } >> + set_bit(bit, attr->bitmap); >> + ret = ram_block_attributes_notify_populate(attr, cur, >> + block_size); >> + if (ret) { >> + break; >> + } >> + } >> + } >> + } >> + >> + return ret; >> +} >> + >> +RamBlockAttributes *ram_block_attributes_create(RAMBlock *ram_block) >> +{ >> + uint64_t bitmap_size; > > Not really needed. > >> + const int block_size = qemu_real_host_page_size(); >> + RamBlockAttributes *attr; >> + int ret; >> + MemoryRegion *mr = ram_block->mr; >> + >> + attr = RAM_BLOCK_ATTRIBUTES(object_new(TYPE_RAM_BLOCK_ATTRIBUTES)); >> + >> + attr->ram_block = ram_block; >> + ret = memory_region_set_ram_discard_manager(mr, >> RAM_DISCARD_MANAGER(attr)); >> + if (ret) { > > Could just "if (memory_region_set_ram_discard_manager(...))". > >> + object_unref(OBJECT(attr)); >> + return NULL; >> + } >> + bitmap_size = ROUND_UP(mr->size, block_size) / block_size; >> + attr->bitmap_size = bitmap_size; >> + attr->bitmap = bitmap_new(bitmap_size); >> + >> + return attr; >> +} >> + >> +void ram_block_attributes_destroy(RamBlockAttributes *attr) >> +{ >> + if (!attr) { > > > Rather g_assert(). > > >> + return; >> + } >> + >> + g_free(attr->bitmap); >> + memory_region_set_ram_discard_manager(attr->ram_block->mr, NULL); >> + object_unref(OBJECT(attr)); >> +} >> + >> +static void ram_block_attributes_init(Object *obj) >> +{ >> + RamBlockAttributes *attr = RAM_BLOCK_ATTRIBUTES(obj); >> + >> + QLIST_INIT(&attr->rdl_list); >> +} > > Not used. > >> + >> +static void ram_block_attributes_finalize(Object *obj) > > Not used. > > Besides these two, feel free to ignore other comments :) The init() and finalize() calls are used in the OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES() macro. It is a common template. I guess you forgot this again[1] :) Other comments look good to me. But I think they are some minor changes. I'll see if need a new version to carry these changes. [1] https://lore.kernel.org/qemu-devel/89e791a5-b71b-4b9d-a8b4-e225bfbd1bc2@xxxxxxx/ > > Otherwise, > > Tested-by: Alexey Kardashevskiy <aik@xxxxxxx> > Reviewed-by: Alexey Kardashevskiy <aik@xxxxxxx> > > >> +{ >> +} >> + >> +static void ram_block_attributes_class_init(ObjectClass *klass, >> + const void *data) >> +{ >> + RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass); >> + >> + rdmc->get_min_granularity = >> ram_block_attributes_rdm_get_min_granularity; >> + rdmc->register_listener = >> ram_block_attributes_rdm_register_listener; >> + rdmc->unregister_listener = >> ram_block_attributes_rdm_unregister_listener; >> + rdmc->is_populated = ram_block_attributes_rdm_is_populated; >> + rdmc->replay_populated = ram_block_attributes_rdm_replay_populated; >> + rdmc->replay_discarded = ram_block_attributes_rdm_replay_discarded; >> +} >> diff --git a/system/trace-events b/system/trace-events >> index be12ebfb41..82856e44f2 100644 >> --- a/system/trace-events >> +++ b/system/trace-events >> @@ -52,3 +52,6 @@ dirtylimit_state_finalize(void) >> dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t >> time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time >> %"PRIi64 " us" >> dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set >> dirty page rate limit %"PRIu64 >> dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) >> "CPU[%d] sleep %"PRIi64 " us" >> + >> +# ram-block-attributes.c >> +ram_block_attributes_state_change(uint64_t offset, uint64_t size, >> const char *from, const char *to) "offset 0x%"PRIx64" size 0x%"PRIx64" >> from '%s' to '%s'" > > >