Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Mon, 19 May 2025 20:14:11 +0000

On Mon, 2025-05-19 at 10:06 -0700, Sean Christopherson wrote:
> On Mon, May 19, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > > kicking vCPUs out of KVM?
> > > 
> > > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > > KVM can simply drop and reacquire SRCU in the relevant paths.
> > 
> > During the initial debugging and kicking around stage, this is the first
> > direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> > kvm_tdp_map_page() tries to unlock without it being held. (although that version
> > didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> > came up with the version in this series, which we held review on for the list:
> 
> Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.
> 
> > > However, upon further consideration, I am reluctant to implement this fix for
> 
> Which fix?

The one considered when debugging the issue. It was pretty much exactly like you
first suggested, but with the scru lock taken in tdx_gmem_post_populate(). Since
it was pretty much the same I just shared Yan's comment on it.

In case it looks like internal code review, here is some more history:
1. Reinette reports bug from internal test, wondering if it's valid userspace
behavior
2. I suggest scru root cause
3. Yan provides specific diff (pretty much what you suggested) for Reinette to
test, who finds the post_populate case generates a warning
4. Yan looks at fixing up post_populate case, but decides she doesn't like it
(the quoted blurb) and develops the alternative in this series

> > > the following reasons:
> > > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > > - While retrying with srcu unlock and lock can workaround the
> > >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> > >   and tdx_handle_ept_violation() faulting with different memslot layouts.
> 
> This behavior has existed since pretty much the beginning of KVM time.
> 

The non-tdx issues today are related to the pre-fault memory stuff, which
doesn't enter the guest for a different reason.

>   TDX is the
> oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
> RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
> RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.
> 
> Arguably, _TDX_ is buggy by not providing this behavior.

Long term the zero step issue needs to be resolved. So yea we should avoid
building around it for anything not-critical (like this). But I don't see why
prefault is not in the same category of oddness.

> 
> > I'm not sure why the second one is really a problem. For the first one I think
> > that path could just take the scru lock in the proper order with kvm-
> > > slots_lock?
> 
> Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
> problematic, as KVM synchronizes SRCU while holding slots_lock.
> 
> That said, I don't love the idea of grabbing SRCU, because it's so obviously a
> hack.  What about something like this?

Like Reinette noticed it doesn't pass the included test. It looks like
synchronize_srcu_expedited() is not waiting any longer, but there is some other
RET_PF_RETRY loop not caused by KVM_MEMSLOT_INVALID. Must be some side effect.
Will debug further.