Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Wed, 21 May 2025 09:45:51 +0800

On Tue, May 20, 2025 at 09:13:25AM -0700, Sean Christopherson wrote:
> On Tue, May 20, 2025, Yan Zhao wrote:
> > On Mon, May 19, 2025 at 10:06:22AM -0700, Sean Christopherson wrote:
> > > On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > > > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > > > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > > > > kicking vCPUs out of KVM?
> > > > > 
> > > > > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > > > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > > > 
> > > > During the initial debugging and kicking around stage, this is the first
> > > > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > > > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > > > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > > > came up with the version in this series, which we held review on for the list:
> > > 
> > > Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> > > 
> > > > > However, upon further consideration, I am reluctant to implement this fix for
> > > 
> > > Which fix?
> > > 
> > > > > the following reasons:
> > > > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > > > - While retrying with srcu unlock and lock can workaround the
> > > > >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > > > >   and tdx_handle_ept_violation() faulting with different memslot layouts.
> > > 
> > > This behavior has existed since pretty much the beginning of KVM time.  TDX is the
> > > oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
> > > RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
> > > RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> > > 
> > > Arguably, _TDX_ is buggy by not providing this behavior.
> > > 
> > > > I'm not sure why the second one is really a problem. For the first one I think
> > > > that path could just take the scru lock in the proper order with kvm-
> > > > >slots_lock?
> > > 
> > > Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
> > > problematic, as KVM synchronizes SRCU while holding slots_lock.
> > > 
> > > That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> > > hack.  What about something like this?
> > So you want to avoid acquiring SRCU in the kvm_gmem_populate() path?
> 
> Yes, ideally.  Acquiring SCRU wouldn't be the end of the world, but I don't love
> the idea of taking a lock just so that the lock can be conditionally dropped in
> a common flow.  It's not a deal breaker (I wouldn't be surprised if there's at
> least one path in KVM that acquires SRCU purely because of such behavior), but
> separating kvm_tdp_prefault_page() from kvm_tdp_map_page() 
> 
> > Generally I think it's good, except that it missed a kvm_mmu_reload() (please
> > refer to my comment below) and the kvm_vcpu_srcu_read_{un}lock() pair in
> > tdx_handle_ept_violation() path (So, Reinette reported it failed the TDX stress
> > tests at [1]).
> 
> > > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > >  
> > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > +{
> > > +	int r;
> > > +
> > > +	/*
> > > +	 * Restrict to TDP page fault, since that's the only case where the MMU
> > > +	 * is indexed by GPA.
> > > +	 */
> > > +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	for (;;) {
> > > +		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
> > > +		if (r != -EAGAIN)
> > > +			break;
> > > +
> > > +		/* Comment goes here. */
> > > +		kvm_vcpu_srcu_read_unlock(vcpu);
> > > +		kvm_vcpu_srcu_read_lock(vcpu);
> > For the hang in the pre_fault_memory_test reported by Reinette [1], it's because
> > the memslot removal succeeds after releasing the SRCU, then the old root is
> > stale. So kvm_mmu_reload() is required here to prevent is_page_fault_stale()
> > from being always true.
> 
> That wouldn't suffice, KVM would also need to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS,
> otherwise kvm_mmu_reload() will do nothing.
In commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in
kvm_mmu_reload()"), KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is processed in
kvm_mmu_reload().


> Thinking about this scenario more, I don't mind punting this problem to userspace
> for KVM_PRE_FAULT_MEMORY because there's no existing behavior/ABI to uphold, and
> because the complexity vs. ABI tradeoffs are heavily weighted in favor of punting
> to userspace.  Whereas for KVM_RUN, KVM can't change existing behavior without
> breaking userspace, should provide consistent behavior regardless of VM type, and
> KVM needs the "complex" code irrespective of this particular scenario.
> 
> I especially like punting to userspace if KVM returns -EAGAIN, not -ENOENT,
> because then KVM is effectively providing the same overall behavior as KVM_RUN,
> just without slightly different roles and responsibilities between KVM and
> userspace.  And -ENOENT is also flat out wrong for the case where a memslot is
> being moved, but the new base+size still contains the to-be-faulted GPA.
> 
> I still don't like RET_PF_RETRY_INVALID_SLOT, because that bleeds gory MMU details
> into the rest of KVM, but KVM can simply return -EAGAIN if an invalid memslot is
> encountered during prefault (as identified by fault->prefetch).
>
> For TDX though, tdx_handle_ept_violation() needs to play nice with the scenario,
> i.e. punting to userspace is not a viable option.  But that path also has options
> that aren't available to prefaulting.  E.g. it could (and probably should) break
> early if a request is pending instead of special casing KVM_REQ_VM_DEAD, which
Hmm, for TDX, there's no request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS for slot
removal. (see commit aa8d1f48d353 ("KVM: x86/mmu: Introduce a quirk to control
memslot zap behavior").

> would take care of the KVM_REQ_MMU_FREE_OBSOLETE_ROOTS scenario.  And as Rick
> called out, the zero-step mess really needs to be solved in a more robust fashion.
> 
> So this?
Looks good to me for non-TDX side.

For TDX, could we provide below fix based on your change?
For private fault, -EFAULT will be returned to userspace after the retry anyway
after the slot is completed removed, which is unlike non-private faults that go
to emulate path after retry.

--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4602,6 +4602,11 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
                if (fault->prefetch)
                        return -EAGAIN;

+               if (fault->is_private) {
+                       kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+                       return -EFAULT;
+               }
+
                return RET_PF_RETRY;
        }


And would you mind if I included your patch in my next version? I can update the
related selftests as well.

Thanks
Yan

> ---
> From: Sean Christopherson <seanjc@xxxxxxxxxx>
> Date: Tue, 20 May 2025 07:55:32 -0700
> Subject: [PATCH] KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves
>  memslot during prefault
> 
> Return -EAGAIN if userspace attemps to delete or move a memslot while also
> prefaulting memory for that same memslot, i.e. force userspace to retry
> instead of trying to handle the scenario entirely within KVM.  Unlike
> KVM_RUN, which needs to handle the scenario entirely within KVM because
> userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can
> return -EAGAIN without breaking userspace as this scenario can't have ever
> worked (and there's no sane use case for prefaulting to a memslot that's
> being deleted/moved).
> 
> And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee
> forward progress.  E.g. to handle such a scenario, KVM would need to drop
> and reacquire SRCU to break the deadlock between the memslot update
> (synchronizes SRCU) and the prefault (waits for the memslot update to
> complete).
> 
> However, dropping SRCU creates more problems, as completing the memslot
> update will bump the memslot generation, which in turn will invalidate the
> MMU root.  To handle that, prefaulting would need to handle pending
> KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to
> mapping each individual.
> 
> I.e. to fully handle this scenario, prefaulting would eventually need to
> look a lot like vcpu_enter_guest().  Given that there's no reasonable use
> case and practically zero risk of breaking userspace, punt the problem to
> userspace and avoid adding unnecessary complexity to the prefualt path.
> 
> Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is
> held for the entire duration of populate(), i.e. any memslot modifications
> will be fully serialized against TDX's flavor of prefaulting.
> 
> Reported-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
> Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@xxxxxxxxx
> Debugged-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
>  arch/x86/kvm/mmu/mmu.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a284dce227a0..7ae56a3c7607 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4595,10 +4595,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
>  	/*
>  	 * Retry the page fault if the gfn hit a memslot that is being deleted
>  	 * or moved.  This ensures any existing SPTEs for the old memslot will
> -	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> +	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.  Punt the
> +	 * error to userspace if this is a prefault, as KVM's prefaulting ABI
> +	 * doesn't need provide the same forward progress guarantees as KVM_RUN.
>  	 */
> -	if (slot->flags & KVM_MEMSLOT_INVALID)
> +	if (slot->flags & KVM_MEMSLOT_INVALID) {
> +		if (fault->prefetch)
> +			return -EAGAIN;
> +
>  		return RET_PF_RETRY;
> +	}
>  
>  	if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
>  		/*
> 
> base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
> --