On 5/27/2025 3:35 PM, Alexey Kardashevskiy wrote: > > > On 20/5/25 20:28, Chenyi Qiang wrote: >> A new state_change() helper is introduced for RamBlockAttribute >> to efficiently notify all registered RamDiscardListeners, including >> VFIO listeners, about memory conversion events in guest_memfd. The VFIO >> listener can dynamically DMA map/unmap shared pages based on conversion >> types: >> - For conversions from shared to private, the VFIO system ensures the >> discarding of shared mapping from the IOMMU. >> - For conversions from private to shared, it triggers the population of >> the shared mapping into the IOMMU. >> >> Currently, memory conversion failures cause QEMU to quit instead of >> resuming the guest or retrying the operation. It would be a future work >> to add more error handling or rollback mechanisms once conversion >> failures are allowed. For example, in-place conversion of guest_memfd >> could retry the unmap operation during the conversion from shared to >> private. However, for now, keep the complex error handling out of the >> picture as it is not required: >> >> - If a conversion request is made for a page already in the desired >> state, the helper simply returns success. >> - For requests involving a range partially in the desired state, there >> is no such scenario in practice at present. Simply return error. >> - If a conversion request is declined by other systems, such as a >> failure from VFIO during notify_to_populated(), the failure is >> returned directly. As for notify_to_discard(), VFIO cannot fail >> unmap/unpin, so no error is returned. >> >> Note that the bitmap status is updated before callbacks, allowing >> listeners to handle memory based on the latest status. >> >> Signed-off-by: Chenyi Qiang <chenyi.qiang@xxxxxxxxx> >> --- >> Change in v5: >> - Move the state_change() back to a helper instead of a callback of >> the class since there's no child for the RamBlockAttributeClass. >> - Remove the error handling and move them to an individual patch for >> simple management. >> >> Changes in v4: >> - Add the state_change() callback in PrivateSharedManagerClass >> instead of the RamBlockAttribute. >> >> Changes in v3: >> - Move the bitmap update before notifier callbacks. >> - Call the notifier callbacks directly in notify_discard/populate() >> with the expectation that the request memory range is in the >> desired attribute. >> - For the case that only partial range in the desire status, handle >> the range with block_size granularity for ease of rollback >> (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3- >> fb7a125cf54e@xxxxxxxxxx/) >> >> Changes in v2: >> - Do the alignment changes due to the rename to >> MemoryAttributeManager >> - Move the state_change() helper definition in this patch. >> --- >> include/system/ramblock.h | 2 + >> system/ram-block-attribute.c | 134 +++++++++++++++++++++++++++++++++++ >> 2 files changed, 136 insertions(+) >> >> diff --git a/include/system/ramblock.h b/include/system/ramblock.h >> index 09255e8495..270dffb2f3 100644 >> --- a/include/system/ramblock.h >> +++ b/include/system/ramblock.h >> @@ -108,6 +108,8 @@ struct RamBlockAttribute { >> QLIST_HEAD(, RamDiscardListener) rdl_list; >> }; >> +int ram_block_attribute_state_change(RamBlockAttribute *attr, >> uint64_t offset, >> + uint64_t size, bool to_private); > > Not sure about the "to_private" name. I'd think private/shared is > something KVM operates with and here, in RamBlock, it is discarded/ > populated. Make sense. To keep consistent, I will rename it as to_discard. > >> RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr); >> void ram_block_attribute_destroy(RamBlockAttribute *attr); >> diff --git a/system/ram-block-attribute.c b/system/ram-block- >> attribute.c >> index 8d4a24738c..f12dd4b881 100644 >> --- a/system/ram-block-attribute.c >> +++ b/system/ram-block-attribute.c >> @@ -253,6 +253,140 @@ ram_block_attribute_rdm_replay_discard(const >> RamDiscardManager *rdm, >> >> ram_block_attribute_rdm_replay_cb); >> } >> +static bool ram_block_attribute_is_valid_range(RamBlockAttribute >> *attr, >> + uint64_t offset, >> uint64_t size) >> +{ >> + MemoryRegion *mr = attr->mr; >> + >> + g_assert(mr); >> + >> + uint64_t region_size = memory_region_size(mr); >> + int block_size = ram_block_attribute_get_block_size(attr); > > It is size_t, not int. Fixed this and all below. Thanks! > >> + >> + if (!QEMU_IS_ALIGNED(offset, block_size)) { > > Does not the @size have to be aligned too? Yes. Actually, the "start" and "size" are already do the alignment check in kvm_convert_memory(). I doubt if we still need it here. Anyway, in case of other users in the future, I'll add it. > >> + return false; >> + } >> + if (offset + size < offset || !size) { > > This could be just (offset + size <= offset). > (these overflow checks always blow up my little brain) Modified. > >> + return false; >> + } >> + if (offset >= region_size || offset + size > region_size) { > > Just (offset + size > region_size) should do. Ditto. > >> + return false; >> + } >> + return true; >> +} >> + >> +static void ram_block_attribute_notify_to_discard(RamBlockAttribute >> *attr, >> + uint64_t offset, >> + uint64_t size) >> +{ >> + RamDiscardListener *rdl; >> + >> + QLIST_FOREACH(rdl, &attr->rdl_list, next) { >> + MemoryRegionSection tmp = *rdl->section; >> + >> + if (!memory_region_section_intersect_range(&tmp, offset, >> size)) { >> + continue; >> + } >> + rdl->notify_discard(rdl, &tmp); >> + } >> +} >> + >> +static int >> +ram_block_attribute_notify_to_populated(RamBlockAttribute *attr, >> + uint64_t offset, uint64_t size) >> +{ >> + RamDiscardListener *rdl; >> + int ret = 0; >> + >> + QLIST_FOREACH(rdl, &attr->rdl_list, next) { >> + MemoryRegionSection tmp = *rdl->section; >> + >> + if (!memory_region_section_intersect_range(&tmp, offset, >> size)) { >> + continue; >> + } >> + ret = rdl->notify_populate(rdl, &tmp); >> + if (ret) { >> + break; >> + } >> + } >> + >> + return ret; >> +} >> + >> +static bool ram_block_attribute_is_range_populated(RamBlockAttribute >> *attr, >> + uint64_t offset, >> + uint64_t size) >> +{ >> + const int block_size = ram_block_attribute_get_block_size(attr); > > size_t. > >> + const unsigned long first_bit = offset / block_size; >> + const unsigned long last_bit = first_bit + (size / block_size) - 1; >> + unsigned long found_bit; >> + >> + /* We fake a shorter bitmap to avoid searching too far. */ > > What is "fake" about it? We truthfully check here that every bit in > [first_bit, last_bit] is set. Aha, you ask this question again :) (https://lore.kernel.org/qemu-devel/7131b4a3-a836-4efd-bcfc-982a0112ef05@xxxxxxxxx/) If it is really confusing, let me remove this comment in next version. > >> + found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1, >> + first_bit); >> + return found_bit > last_bit; >> +} >> + >> +static bool >> +ram_block_attribute_is_range_discard(RamBlockAttribute *attr, >> + uint64_t offset, uint64_t size) >> +{ >> + const int block_size = ram_block_attribute_get_block_size(attr); > > size_t. > >> + const unsigned long first_bit = offset / block_size; >> + const unsigned long last_bit = first_bit + (size / block_size) - 1; >> + unsigned long found_bit; >> + >> + /* We fake a shorter bitmap to avoid searching too far. */ >> + found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit); >> + return found_bit > last_bit; >> +} >> + >> +int ram_block_attribute_state_change(RamBlockAttribute *attr, >> uint64_t offset, >> + uint64_t size, bool to_private) >> +{ >> + const int block_size = ram_block_attribute_get_block_size(attr); > > size_t. > >> + const unsigned long first_bit = offset / block_size; >> + const unsigned long nbits = size / block_size; >> + int ret = 0; >> + >> + if (!ram_block_attribute_is_valid_range(attr, offset, size)) { >> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >> + __func__, offset, size); >> + return -1; > > May be -EINVAL? Modified. > >> + } >> + >> + /* Already discard/populated */ >> + if ((ram_block_attribute_is_range_discard(attr, offset, size) && >> + to_private) || >> + (ram_block_attribute_is_range_populated(attr, offset, size) && >> + !to_private)) { > > A tracepoint would be useful here imho. [...] > >> + return 0; >> + } >> + >> + /* Unexpected mixture */ >> + if ((!ram_block_attribute_is_range_populated(attr, offset, size) && >> + to_private) || >> + (!ram_block_attribute_is_range_discard(attr, offset, size) && >> + !to_private)) { >> + error_report("%s, the range is not all in the desired state: " >> + "(offset 0x%lx, size 0x%lx), %s", >> + __func__, offset, size, >> + to_private ? "private" : "shared"); >> + return -1; > > -EBUSY? Maybe also -EINVAL since it is due to the invalid provided mixture range? But Anyway, according to the discussion in patch #10, I'll add the support for this mixture scenario. No need to return the error. > >> + } >> + >> + if (to_private) { >> + bitmap_clear(attr->bitmap, first_bit, nbits); >> + ram_block_attribute_notify_to_discard(attr, offset, size); >> + } else { >> + bitmap_set(attr->bitmap, first_bit, nbits); >> + ret = ram_block_attribute_notify_to_populated(attr, offset, >> size); >> + } > > and a successful tracepoint here may be? Good suggestion! I'll add tracepoint in next version. > >> + >> + return ret; >> +} >> + >> RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr) >> { >> uint64_t bitmap_size; >