On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote: > On Thu, Jul 10, 2025 at 09:24:13AM -0700, Sean Christopherson wrote: > > On Wed, Jul 09, 2025, Michael Roth wrote: > > > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote: > > > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region() > > > > to use open code to populate the initial memory region into the mirror page > > > > table, and add the region to S-EPT. > > > > > > > > Background > > > > === > > > > Sean initially suggested TDX to populate initial memory region in a 4-step > > > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate() > > > > interface [2] to help TDX populate init memory region. > > > > I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea." > > after all :-) > > > > > > tdx_vcpu_init_mem_region > > > > guard(mutex)(&kvm->slots_lock) > > > > kvm_gmem_populate > > > > filemap_invalidate_lock(file->f_mapping) > > > > __kvm_gmem_get_pfn //1. get private PFN > > > > post_populate //tdx_gmem_post_populate > > > > get_user_pages_fast //2. get source page > > > > kvm_tdp_map_page //3. map private PFN to mirror root > > > > tdh_mem_page_add //4. add private PFN to S-EPT and copy > > > > source page to it. > > > > > > > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file > > > > invalidate lock also helps ensure the private PFN remains valid when > > > > tdh_mem_page_add() is invoked in TDX's post_populate hook. > > > > > > > > Though TDX does not need the folio prepration code, kvm_gmem_populate() > > > > helps on sharing common code between SEV-SNP and TDX. > > > > > > > > Problem > > > > === > > > > (1) > > > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking > > > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap > > > > invalidation lock for protecting its preparedness tracking. Similarly, the > > > > in-place conversion version of guest_memfd series by Ackerly also requires > > > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5]. > > > > > > > > kvm_gmem_get_pfn > > > > filemap_invalidate_lock_shared(file_inode(file)->i_mapping); > > > > > > > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is > > > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the > > > > filemap invalidation lock. > > > > > > Bringing the prior discussion over to here: it seems wrong that > > > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate() > > > chain, because: > > > > > > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to > > > tdx_gmem_post_populate(), but we are throwing it away to grab it > > > again kvm_gmem_get_pfn(), which is then creating these locking issues > > > that we are trying to work around. If we could simply pass that PFN down > > > to kvm_tdp_map_page() (or some variant), then we would not trigger any > > > deadlocks in the first place. > > > > Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw. > > > > > 2) kvm_gmem_populate() is intended for pre-boot population of guest > > > memory, and allows the post_populate callback to handle setting > > > up the architecture-specific preparation, whereas kvm_gmem_get_pfn() > > > calls kvm_arch_gmem_prepare(), which is intended to handle post-boot > > > setup of private memory. Having kvm_gmem_get_pfn() called as part of > > > kvm_gmem_populate() chain brings things 2 things in conflict with > > > each other, and TDX seems to be relying on that fact that it doesn't > > > implement a handler for kvm_arch_gmem_prepare(). > > > > > > I don't think this hurts anything in the current code, and I don't > > > personally see any issue with open-coding the population path if it doesn't > > > fit TDX very well, but there was some effort put into making > > > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the > > > design of the interface itself, but instead just some inflexibility on the > > > KVM MMU mapping side, then it seems more robust to address the latter if > > > possible. > > > > > > Would something like the below be reasonable? > > > > No, polluting the page fault paths is a non-starter for me. TDX really shouldn't > > be synthesizing a page fault when it has the PFN in hand. And some of the behavior > > that's desirable for pre-faults looks flat out wrong for TDX. E.g. returning '0' > > on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?). > > > > I would much rather special case this path, because it absolutely is a special > > snowflake. This even eliminates several exports of low level helpers that frankly > > have no business being used by TDX, e.g. kvm_mmu_reload(). > > > > --- > > arch/x86/kvm/mmu.h | 2 +- > > arch/x86/kvm/mmu/mmu.c | 78 ++++++++++++++++++++++++++++++++++++-- > > arch/x86/kvm/mmu/tdp_mmu.c | 1 - > > arch/x86/kvm/vmx/tdx.c | 24 ++---------- > > 4 files changed, 78 insertions(+), 27 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > > index b4b6860ab971..9cd7a34333af 100644 > > --- a/arch/x86/kvm/mmu.h > > +++ b/arch/x86/kvm/mmu.h > > @@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled; > > #endif > > > > bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa); > > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level); > > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn); > > > > static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) > > { > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 6e838cb6c9e1..bc937f8ed5a0 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > return direct_page_fault(vcpu, fault); > > } > > > > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level) > > +static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, > > + u64 error_code, u8 *level) > > { > > int r; > > > > @@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level > > return -EIO; > > } > > } > > -EXPORT_SYMBOL_GPL(kvm_tdp_map_page); > > > > long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > > struct kvm_pre_fault_memory *range) > > @@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > > * Shadow paging uses GVA for kvm page fault, so restrict to > > * two-dimensional paging. > > */ > > - r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level); > > + r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level); > > if (r < 0) > > return r; > > > > @@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > > return min(range->size, end - range->gpa); > > } > > > > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) > > +{ > > + struct kvm_page_fault fault = { > > + .addr = gfn_to_gpa(gfn), > > + .error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS, > > + .prefetch = true, > > + .is_tdp = true, > > + .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm), > > + > > + .max_level = KVM_MAX_HUGEPAGE_LEVEL, > > + .req_level = PG_LEVEL_4K, > kvm_mmu_hugepage_adjust() will replace the PG_LEVEL_4K here to PG_LEVEL_2M, > because the private_max_mapping_level hook is only invoked in > kvm_mmu_faultin_pfn_gmem(). > > Updating lpage_info can fix it though. > > > + .goal_level = PG_LEVEL_4K, > > + .is_private = true, > > + > > + .gfn = gfn, > > + .slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn), > > + .pfn = pfn, > > + .map_writable = true, > > + }; > > + struct kvm *kvm = vcpu->kvm; > > + int r; > > + > > + lockdep_assert_held(&kvm->slots_lock); > > + > > + if (KVM_BUG_ON(!tdp_mmu_enabled, kvm)) > > + return -EIO; > > + > > + if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn)) > > + return -EPERM; > > + > > + r = kvm_mmu_reload(vcpu); > > + if (r) > > + return r; > > + > > + r = mmu_topup_memory_caches(vcpu, false); > > + if (r) > > + return r; > > + > > + do { > > + if (signal_pending(current)) > > + return -EINTR; > > + > > + if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) > > + return -EIO; > > + > > + cond_resched(); > > + > > + guard(read_lock)(&kvm->mmu_lock); > > + > > + r = kvm_tdp_mmu_map(vcpu, &fault); > > + } while (r == RET_PF_RETRY); > > + > > + if (r != RET_PF_FIXED) > > + return -EIO; > > + > > + /* > > + * The caller is responsible for ensuring that no MMU invalidations can > > + * occur. Sanity check that the mapping hasn't been zapped. > > + */ > > + if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) { > > + cond_resched(); > > + > > + scoped_guard(read_lock, &kvm->mmu_lock) { > > + if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm)) > > + return -EIO; > > + } > > + } > > + return 0; > > +} > > +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn); > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch > log: > > Problem > === > ... > (2) > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock, > resulting in the following lock sequence in tdx_vcpu_init_mem_region(): > - filemap invalidation lock --> mm->mmap_lock > > However, in future code, the shared filemap invalidation lock will be held > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence: > - mm->mmap_lock --> filemap invalidation lock I wouldn't expect kvm_gmem_fault_shared() to trigger for the KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it). There was some discussion during previous guest_memfd upstream call (May/June?) about whether to continue using kvm_gmem_populate() (or the callback you hand it) to handle initializing memory contents before in-place encryption, verses just expecting that userspace will initialize the contents directly via mmap() prior to issuing any calls that trigger kvm_gmem_populate(). I was planning on enforcing that the 'src' parameter to kvm_gmem_populate() must be NULL for cases where KVM_MEMSLOT_SUPPORTS_GMEM_SHARED is set, or otherwise it will return -EINVAL, because: 1) it avoids this awkward path you mentioned where kvm_gmem_fault_shared() triggers during kvm_gmem_populate() 2) it makes no sense to have to have to copy anything from 'src' when we now support in-place update For the SNP side, that will require a small API update for SNP_LAUNCH_UPDATE that mandates that corresponding 'uaddr' argument is ignored/disallowed in favor of in-place initialization from userspace via mmap(). Not sure if TDX would need similar API update. Would that work on the TDX side as well? Thanks, Mike