On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote: > On Tue, May 20, 2025, Yan Zhao wrote: > > On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote: > > > On Mon, May 19, 2025, Rick P Edgecombe wrote: > > > > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote: > > > > > Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without > > > > > kicking vCPUs out of KVM? > > > > > > > > > > Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance, > > > > > KVM can simply drop and reacquire SRCU in the relevant paths. > > > > > > > > During the initial debugging and kicking around stage, this is the first > > > > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then > > > > kvm_tdp_map_page() tries to unlock without it being held. (although that version > > > > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and > > > > came up with the version in this series, which we held review on for the list: > > > > > > Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain. > > > > > > > > However, upon further consideration, I am reluctant to implement this fix for > > > > > > Which fix? > > > > > > > > the following reasons: > > > > > - kvm_gmem_populate() already holds the kvm->slots_lock. > > > > > - While retrying with srcu unlock and lock can workaround the > > > > > KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory() > > > > > and tdx_handle_ept_violation() faulting with different memslot layouts. > > > > > > This behavior has existed since pretty much the beginning of KVM time. TDX is the > > > oddball that doesn't re-enter the guest. All other flavors re-enter the guest on > > > RET_PF_RETRY, which means dropping and reacquiring SRCU. Which is why I don't like > > > RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about. > > > > > > Arguably, _TDX_ is buggy by not providing this behavior. > > > > > > > I'm not sure why the second one is really a problem. For the first one I think > > > > that path could just take the scru lock in the proper order with kvm- > > > > >slots_lock? > > > > > > Acquiring SRCU inside slots_lock should be fine. The reserve order would be > > > problematic, as KVM synchronizes SRCU while holding slots_lock. > > > > > > That said, I don't love the idea of grabbing SRCU, because it's so obviously a > > > hack. What about something like this? > > So you want to avoid acquiring SRCU in the kvm_gmem_populate() path? > > Yes, ideally. Acquiring SCRU wouldn't be the end of the world, but I don't love > the idea of taking a lock just so that the lock can be conditionally dropped in > a common flow. It's not a deal breaker (I wouldn't be surprised if there's at > least one path in KVM that acquires SRCU purely because of such behavior), but > separating kvm_tdp_prefault_page() from kvm_tdp_map_page() > > > Generally I think it's good, except that it missed a kvm_mmu_reload() (please > > refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in > > tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress > > tests at [1]). > > > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level > > > } > > > EXPORT_SYMBOL_GPL(kvm_tdp_map_page); > > > > > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level) > > > +{ > > > + int r; > > > + > > > + /* > > > + * Restrict to TDP page fault, since that's the only case where the MMU > > > + * is indexed by GPA. > > > + */ > > > + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault) > > > + return -EOPNOTSUPP; > > > + > > > + for (;;) { > > > + r = kvm_tdp_map_page(vcpu, gpa, error_code, level); > > > + if (r != -EAGAIN) > > > + break; > > > + > > > + /* Comment goes here. */ > > > + kvm_vcpu_srcu_read_unlock(vcpu); > > > + kvm_vcpu_srcu_read_lock(vcpu); > > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because > > the memslot removal succeeds after releasing the SRCU, then the old root is > > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale() > > from being always true. > > That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, > otherwise kvm_mmu_reload() will do nothing. In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in kvm_mmu_reload(). > Thinking about this scenario more, I don't mind punting this problem to userspace > for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and > because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting > to userspace. Whereas for KVM_RUN, KVM can't change existing behavior without > breaking userspace, should provide consistent behavior regardless of VM type, and > KVM needs the "complex" code irrespective of this particular scenario. > > I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT, > because then KVM is effectively providing the same overall behavior as KVM_RUN, > just without slightly different roles and responsibilities between KVM and > userspace. And -ENOENT is also flat out wrong for the case where a memslot is > being moved, but the new base+size still contains the to-be-faulted GPA. > > I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details > into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is > encountered during prefault (as identified by fault->prefetch). > > For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario, > i.e. punting to userspace is not a viable option. But that path also has options > that aren't available to prefaulting. E.g. it could (and probably should) break > early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control memslot zap behavior"). > would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario. And as Rick > called out, the zero-step mess really needs to be solved in a more robust fashion. > > So this? Looks good to me for non-TDX side. For TDX, could we provide below fix based on your change? For private fault, -EFAULT will be returned to userspace after the retry anyway after the slot is completed removed, which is unlike non-private faults that go to emulate path after retry. --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4602,6 +4602,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, if (fault->prefetch) return -EAGAIN; + if (fault->is_private) { + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); + return -EFAULT; + } + return RET_PF_RETRY; } And would you mind if I included your patch in my next version? I can update the related selftests as well. Thanks Yan > --- > From: Sean Christopherson <seanjc@xxxxxxxxxx> > Date: Tue, 20 May 2025 07:55:32 -0700 > Subject: [PATCH] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves > memslot during prefault > > Return -EAGAIN if userspace attemps to delete or move a memslot while also > prefaulting memory for that same memslot, i.e. force userspace to retry > instead of trying to handle the scenario entirely within KVM. Unlike > KVM_RUN, which needs to handle the scenario entirely within KVM because > userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can > return -EAGAIN without breaking userspace as this scenario can't have ever > worked (and there's no sane use case for prefaulting to a memslot that's > being deleted/moved). > > And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee > forward progress. E.g. to handle such a scenario, KVM would need to drop > and reacquire SRCU to break the deadlock between the memslot update > (synchronizes SRCU) and the prefault (waits for the memslot update to > complete). > > However, dropping SRCU creates more problems, as completing the memslot > update will bump the memslot generation, which in turn will invalidate the > MMU root. To handle that, prefaulting would need to handle pending > KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to > mapping each individual. > > I.e. to fully handle this scenario, prefaulting would eventually need to > look a lot like vcpu_enter_guest(). Given that there's no reasonable use > case and practically zero risk of breaking userspace, punt the problem to > userspace and avoid adding unnecessary complexity to the prefualt path. > > Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is > held for the entire duration of populate(), i.e. any memslot modifications > will be fully serialized against TDX's flavor of prefaulting. > > Reported-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> > Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@xxxxxxxxx > Debugged-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/mmu/mmu.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index a284dce227a0..7ae56a3c7607 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4595,10 +4595,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, > /* > * Retry the page fault if the gfn hit a memslot that is being deleted > * or moved. This ensures any existing SPTEs for the old memslot will > - * be zapped before KVM inserts a new MMIO SPTE for the gfn. > + * be zapped before KVM inserts a new MMIO SPTE for the gfn. Punt the > + * error to userspace if this is a prefault, as KVM's prefaulting ABI > + * doesn't need provide the same forward progress guarantees as KVM_RUN. > */ > - if (slot->flags & KVM_MEMSLOT_INVALID) > + if (slot->flags & KVM_MEMSLOT_INVALID) { > + if (fault->prefetch) > + return -EAGAIN; > + > return RET_PF_RETRY; > + } > > if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { > /* > > base-commit: 45eb29140e68ffe8e93a5471006858a018480a45 > --