Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()

Ira Weiny <ira.weiny@xxxxxxxxx> · Thu, 10 Jul 2025 20:41:08 -0500

Sean Christopherson wrote:
> On Wed, Jul 09, 2025, Michael Roth wrote:
> > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > > to use open code to populate the initial memory region into the mirror page
> > > table, and add the region to S-EPT.
> > > 
> > > Background
> > > ===
> > > Sean initially suggested TDX to populate initial memory region in a 4-step
> > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > > interface [2] to help TDX populate init memory region.
> 
> I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
> after all :-)
> 
> > > tdx_vcpu_init_mem_region
> > >     guard(mutex)(&kvm->slots_lock)
> > >     kvm_gmem_populate
> > >         filemap_invalidate_lock(file->f_mapping)
> > >             __kvm_gmem_get_pfn      //1. get private PFN
> > >             post_populate           //tdx_gmem_post_populate
> > >                 get_user_pages_fast //2. get source page
> > >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> > >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> > >                                          source page to it.
> > > 
> > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > > invalidate lock also helps ensure the private PFN remains valid when
> > > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > > 
> > > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > > helps on sharing common code between SEV-SNP and TDX.
> > > 
> > > Problem
> > > ===
> > > (1)
> > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > > invalidation lock for protecting its preparedness tracking. Similarly, the
> > > in-place conversion version of guest_memfd series by Ackerly also requires
> > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > > 
> > > kvm_gmem_get_pfn
> > >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > 
> > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > > filemap invalidation lock.
> > 
> > Bringing the prior discussion over to here: it seems wrong that
> > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> > chain, because:
> > 
> > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
> >    tdx_gmem_post_populate(), but we are throwing it away to grab it
> >    again kvm_gmem_get_pfn(), which is then creating these locking issues
> >    that we are trying to work around. If we could simply pass that PFN down
> >    to kvm_tdp_map_page() (or some variant), then we would not trigger any
> >    deadlocks in the first place.
> 
> Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.
> 
> > 2) kvm_gmem_populate() is intended for pre-boot population of guest
> >    memory, and allows the post_populate callback to handle setting
> >    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
> >    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
> >    setup of private memory. Having kvm_gmem_get_pfn() called as part of
> >    kvm_gmem_populate() chain brings things 2 things in conflict with
> >    each other, and TDX seems to be relying on that fact that it doesn't
> >    implement a handler for kvm_arch_gmem_prepare(). 
> > 
> > I don't think this hurts anything in the current code, and I don't
> > personally see any issue with open-coding the population path if it doesn't
> > fit TDX very well, but there was some effort put into making
> > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> > design of the interface itself, but instead just some inflexibility on the
> > KVM MMU mapping side, then it seems more robust to address the latter if
> > possible.
> > 
> > Would something like the below be reasonable? 
> 
> No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
> be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
> that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
> on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).
> 
> I would much rather special case this path, because it absolutely is a special
> snowflake.  This even eliminates several exports of low level helpers that frankly
> have no business being used by TDX, e.g. kvm_mmu_reload().

I'm not quite following what the code below is for.  Is it an addition to
Yan's patch to eliminate the use of kvm_gmem_populate() from TDX?
I don't see how this code helps with the lock invalidation so I think we
still need Yan's patch, correct?