Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls

Alexey Kardashevskiy <aik@xxxxxxx> · Thu, 10 Jul 2025 16:57:25 +1000

On 1/7/25 00:19, Vishal Annapurve wrote:
On Sun, Jun 29, 2025 at 5:19 PM Alexey Kardashevskiy <aik@xxxxxxx> wrote:
...
============================

For IOMMU, could something like below work?

* A new UAPI to bind IOMMU FDs with guest_memfd ranges

Done that.

* VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
guest_memfd ranges using kvm_gmem_get_pfn()

This API imho should drop the confusing kvm_ prefix.

       -> kvm invokes kvm_gmem_is_private() to check for the range
shareability, IOMMU could use the same or we could add an API in gmem
that takes in access type and checks the shareability before returning
the pfn.

Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.

I don't think that's the way to avoid links between iommufd.ko and
kvm.ko. Cleaner way probably is to have gmem logic built-in and allow
runtime registration of invalidation callbacks from KVM/IOMMU
backends. Need to think about this more.

Yeah, otherwise iommufd.ko will have to install a hook in guest_memfd (==kvm.ko) in run time so more beloved symbol_get() :)

* IOMMU stack exposes an invalidation callback that can be invoked by
guest_memfd.

Private to Shared conversion via kvm_gmem_convert_range() -
       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the KVM SEPT/NPT entries.
              -> guest_memfd invokes IOMMU invalidation callback to zap
the secure IOMMU entries.
        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then splits the folios if needed
        4) Userspace invokes IOMMU map operation to map the ranges in
non-secure IOMMU.

Shared to private conversion via kvm_gmem_convert_range() -
       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the host mappings which will unmap the KVM non-seucure
EPT/NPT entries.
            -> guest_memfd invokes IOMMU invalidation callback to zap the
non-secure IOMMU entries.
        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then merges the folios if needed.
        4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.

Alright (although this zap+map is not necessary on the AMD hw).

IMO guest_memfd ideally should not directly interact or cater to arch
specific needs, it should implement a mechanism that works for all
archs. KVM/IOMMU implement invalidation callbacks and have all the
architecture specific knowledge to take the right decisions.

Every page conversion will go through:

kvm-amd.ko -1-> guest_memfd (kvm.ko) -2-> iommufd.ko -3-> amd-iommu (build-in).

Which one decides on IOMMU not needing (un)mapping? Got to be (1) but then it need to propagate the decision to amd-iommu (and we do not have (3) at the moment in that path).

If there is a need, guest_memfd can support two different callbacks:
1) Conversion notifier/callback invoked by guest_memfd during
conversion handling.
2) Invalidation notifier/callback invoked by guest_memfd during truncation.

Iommufd/kvm can handle conversion callback/notifier as per the needs
of underlying architecture. e.g. for TDX connect do the unmapping vs
for SEV Trusted IO skip the unmapping.

Invalidation callback/notifier will need to be handled by unmapping page tables.

Or we just always do unmap+map (and trigger unwanted page huge page smashing)? All is doable and neither particularly horrible, I'm trying to see where the consensus is now. Thanks,

I assume when you say huge page smashing, it means huge page NPT
mapping getting split.

AFAIR, based on discussion with Michael during guest_memfd calls,
stage2 NPT entries need to be of the same granularity as RMP tables
for AMD SNP guests. i.e. huge page NPT mappings need to be smashed on
the KVM side during conversion. So today guest_memfd sends
invalidation notification to KVM for both conversion and truncation.
Doesn't the same constraint for keeping IOMMU page tables at the same
granularity as RMP tables hold for trusted IO?

Currently I handle this from the KVM with a hack to get IOPDE from AMD IOMMU so both 2MB RMP entry and IOPDE entries are smashed in one go in one of many firmwares running on EPYC, and atm this is too hacky to be posted even as an RFC. This likely needs to move to IOMMUFD then (via some callbacks) which could call AMD IOMMU which then would call that firmware (called "TMPM" and it is not the PSP which is "TSM), probably. Thanks,

--
Alexey