Sean Christopherson wrote: > On Wed, Jul 09, 2025, Michael Roth wrote: > > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote: > > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region() > > > to use open code to populate the initial memory region into the mirror page > > > table, and add the region to S-EPT. > > > > > > Background > > > === > > > Sean initially suggested TDX to populate initial memory region in a 4-step > > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate() > > > interface [2] to help TDX populate init memory region. > > I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea." > after all :-) > > > > tdx_vcpu_init_mem_region > > > guard(mutex)(&kvm->slots_lock) > > > kvm_gmem_populate > > > filemap_invalidate_lock(file->f_mapping) > > > __kvm_gmem_get_pfn //1. get private PFN > > > post_populate //tdx_gmem_post_populate > > > get_user_pages_fast //2. get source page > > > kvm_tdp_map_page //3. map private PFN to mirror root > > > tdh_mem_page_add //4. add private PFN to S-EPT and copy > > > source page to it. > > > > > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file > > > invalidate lock also helps ensure the private PFN remains valid when > > > tdh_mem_page_add() is invoked in TDX's post_populate hook. > > > > > > Though TDX does not need the folio prepration code, kvm_gmem_populate() > > > helps on sharing common code between SEV-SNP and TDX. > > > > > > Problem > > > === > > > (1) > > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking > > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap > > > invalidation lock for protecting its preparedness tracking. Similarly, the > > > in-place conversion version of guest_memfd series by Ackerly also requires > > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5]. > > > > > > kvm_gmem_get_pfn > > > filemap_invalidate_lock_shared(file_inode(file)->i_mapping); > > > > > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is > > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the > > > filemap invalidation lock. > > > > Bringing the prior discussion over to here: it seems wrong that > > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate() > > chain, because: > > > > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to > > tdx_gmem_post_populate(), but we are throwing it away to grab it > > again kvm_gmem_get_pfn(), which is then creating these locking issues > > that we are trying to work around. If we could simply pass that PFN down > > to kvm_tdp_map_page() (or some variant), then we would not trigger any > > deadlocks in the first place. > > Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw. > > > 2) kvm_gmem_populate() is intended for pre-boot population of guest > > memory, and allows the post_populate callback to handle setting > > up the architecture-specific preparation, whereas kvm_gmem_get_pfn() > > calls kvm_arch_gmem_prepare(), which is intended to handle post-boot > > setup of private memory. Having kvm_gmem_get_pfn() called as part of > > kvm_gmem_populate() chain brings things 2 things in conflict with > > each other, and TDX seems to be relying on that fact that it doesn't > > implement a handler for kvm_arch_gmem_prepare(). > > > > I don't think this hurts anything in the current code, and I don't > > personally see any issue with open-coding the population path if it doesn't > > fit TDX very well, but there was some effort put into making > > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the > > design of the interface itself, but instead just some inflexibility on the > > KVM MMU mapping side, then it seems more robust to address the latter if > > possible. > > > > Would something like the below be reasonable? > > No, polluting the page fault paths is a non-starter for me. TDX really shouldn't > be synthesizing a page fault when it has the PFN in hand. And some of the behavior > that's desirable for pre-faults looks flat out wrong for TDX. E.g. returning '0' > on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?). > > I would much rather special case this path, because it absolutely is a special > snowflake. This even eliminates several exports of low level helpers that frankly > have no business being used by TDX, e.g. kvm_mmu_reload(). I'm not quite following what the code below is for. Is it an addition to Yan's patch to eliminate the use of kvm_gmem_populate() from TDX? I don't see how this code helps with the lock invalidation so I think we still need Yan's patch, correct?