Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Vishal Annapurve <vannapurve@xxxxxxxxxx> · Sat, 12 Jul 2025 10:33:12 -0700

On Fri, Jul 11, 2025 at 2:18 PM Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote:
>
> On Wed, Jul 9, 2025 at 6:30 PM Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote:
> > > > 3) KVM should ideally associate the lifetime of backing
> > > > pagetables/protection tables/RMP tables with the lifetime of the
> > > > binding of memslots with guest_memfd.
> > >
> > > Again, please align your indentation.
> > >
> > > >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> > > >            long the folios are mapped in guest_memfd, which I think should be
> > > >            revisited.
> > >
> > > Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
> > > inodes are tied to the Virtual Machine, not to the "struct kvm" instance.
> >
> > IIUC guest_memfd can only be accessed through the window of memslots
> > and if there are no memslots I don't see the reason for memory still
> > being associated with "virtual machine". Likely because I am yet to
> > completely wrap my head around 'guest_memfd inodes are tied to the
> > Virtual Machine, not to the "struct kvm" instance', I need to spend
> > more time on this one.
> >
>
> I see the benefits of tying inodes to the virtual machine and
> different guest_memfd files to different KVM instances. This allows us
> to exercise intra-host migration usecases for TDX/SNP. But I think
> this model doesn't allow us to reuse guest_memfd files for SNP VMs
> during reboot.
>
> Reboot scenario assuming reuse of existing guest_memfd inode for the
> next instance:
> 1) Create a VM
> 2) Create guest_memfd files that pin KVM instance
> 3) Create memslots
> 4) Start the VM
> 5) For reboot/shutdown, Execute VM specific Termination (e.g.
> KVM_TDX_TERMINATE_VM)
> 6) if allowed, delete the memslots
> 7) Create a new VM instance
> 8) Link the existing guest_memfd files to the new VM -> which creates
> new files for the same inode.
> 9) Close the existing guest_memfd files and the existing VM
> 10) Jump to step 3
>
> The difference between SNP and TDX is that TDX memory ownership is
> limited to the duration the pages are mapped in the second stage
> secure EPT tables, whereas SNP/RMP memory ownership lasts beyond
> memslots and effectively remains till folios are punched out from
> guest_memfd filemap. IIUC CCA might follow the suite of SNP in this
> regard with the pfns populated in GPT entries.
>
> I don't have a sense of how critical this problem could be, but this
> would mean for every reboot all large memory allocations will have to
> let go and need to be reallocated. For 1G support, we will be freeing
> guest_memfd pages using a background thread which may add some delays
> in being able to free up the memory in time.
>
> Instead if we did this:
> 1) Support creating guest_memfd files for a certain VM type that
> allows KVM to dictate the behavior of the guest_memfd.
> 2) Tie lifetime of KVM SNP/TDX memory ownership with guest_memfd and
> memslot bindings
>     - Each binding will increase a refcount on both guest_memfd file
> and KVM, so both can't go away while the binding exists.

I think if we can ensure that any guest_memfd initiated interaction
with KVM is only for invalidation and is based on binding and under
filemap_invalidate_lock then there is no need to pin KVM on each
binding, as binding/unbinding should be protected using
filemap_invalidate_lock and so KVM can't go away during invalidation.

> 3) For SNP/CCA, pfns are invalidated from RMP/GPT tables during unbind
> operations while for TDX, KVM will invalidate secure EPT entries.
>
> This can allow us to decouple memory lifecycle from VM lifecycle and
> match the behavior with non-confidential VMs where memory can outlast
> VMs. Though this approach will mean change in intrahost migration
> implementation as we don't need to differentiate guest_memfd files and
> inodes.
>
> That being said, I might be missing something here and I don't have
> any data to back the criticality of this usecase for SNP and possibly
> CCA VMs.