On Thu, Jun 26, 2025 at 9:50 PM Alexey Kardashevskiy <aik@xxxxxxx> wrote: > > > > On 25/6/25 00:10, Vishal Annapurve wrote: > > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > >> > >> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote: > >> > >>> Now, I am rebasing my RFC on top of this patchset and it fails in > >>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these > >>> folios in my RFC. > >>> > >>> So what is the expected sequence here? The userspace unmaps a DMA > >>> page and maps it back right away, all from the userspace? The end > >>> result will be the exactly same which seems useless. And IOMMU TLB > > > > As Jason described, ideally IOMMU just like KVM, should just: > > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken > > by IOMMU stack > > 2) Directly query pfns from guest_memfd for both shared/private ranges > > 3) Implement an invalidation callback that guest_memfd can invoke on > > conversions. Conversions and truncations both. > > > > Current flow: > > Private to Shared conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > -> KVM has the concept of invalidation_begin() and end(), > > which effectively ensures that between these function calls, no new > > EPT/NPT entries can be added for the range. > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the KVM SEPT/NPT entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then splits the folios if needed > > > > Shared to private conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the host mappings which will unmap the KVM non-seucure > > EPT/NPT entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then merges the folios if needed. > > > > ============================ > > > > For IOMMU, could something like below work? > > > > * A new UAPI to bind IOMMU FDs with guest_memfd ranges > > Done that. > > > * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from > > guest_memfd ranges using kvm_gmem_get_pfn() > > This API imho should drop the confusing kvm_ prefix. > > > -> kvm invokes kvm_gmem_is_private() to check for the range > > shareability, IOMMU could use the same or we could add an API in gmem > > that takes in access type and checks the shareability before returning > > the pfn. > > Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though. I don't think that's the way to avoid links between iommufd.ko and kvm.ko. Cleaner way probably is to have gmem logic built-in and allow runtime registration of invalidation callbacks from KVM/IOMMU backends. Need to think about this more. > > > > * IOMMU stack exposes an invalidation callback that can be invoked by > > guest_memfd. > > > > Private to Shared conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the KVM SEPT/NPT entries. > > -> guest_memfd invokes IOMMU invalidation callback to zap > > the secure IOMMU entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then splits the folios if needed > > 4) Userspace invokes IOMMU map operation to map the ranges in > > non-secure IOMMU. > > > > Shared to private conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the host mappings which will unmap the KVM non-seucure > > EPT/NPT entries. > > -> guest_memfd invokes IOMMU invalidation callback to zap the > > non-secure IOMMU entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then merges the folios if needed. > > 4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU. > > > Alright (although this zap+map is not necessary on the AMD hw). IMO guest_memfd ideally should not directly interact or cater to arch specific needs, it should implement a mechanism that works for all archs. KVM/IOMMU implement invalidation callbacks and have all the architecture specific knowledge to take the right decisions. > > > > There should be a way to block external IOMMU pagetable updates while > > guest_memfd is performing conversion e.g. something like > > kvm_invalidate_begin()/end(). > > > >>> is going to be flushed on a page conversion anyway (the RMPUPDATE > >>> instruction does that). All this is about AMD's x86 though. > >> > >> The iommu should not be using the VMA to manage the mapping. It should > > > > +1. > > Yeah, not doing this already, because I physically cannot map gmemfd's memory in IOMMU via VMA (which allocates memory via gup() so wrong memory is mapped in IOMMU). Thanks, > > > >> be directly linked to the guestmemfd in some way that does not disturb > >> its operations. I imagine there would be some kind of invalidation > >> callback directly to the iommu. > >> > >> Presumably that invalidation call back can include a reason for the > >> invalidation (addr change, shared/private conversion, etc) > >> > >> I'm not sure how we will figure out which case is which but guestmemfd > >> should allow the iommu to plug in either invalidation scheme.. > >> > >> Probably invalidation should be a global to the FD thing, I imagine > >> that once invalidation is established the iommu will not be > >> incrementing page refcounts. > > > > +1. > > Alright. Thanks for the comments. > > > > >> > >> Jason > > -- > Alexey >