Hi Sean, On 5/19/25 10:06 AM, Sean Christopherson wrote: > On Mon, May 19, 2025, Rick P Edgecombe wrote: >> On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote: >>> Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without >>> kicking vCPUs out of KVM? >>> >>> Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance, >>> KVM can simply drop and reacquire SRCU in the relevant paths. >> >> During the initial debugging and kicking around stage, this is the first >> direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then >> kvm_tdp_map_page() tries to unlock without it being held. (although that version >> didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and >> came up with the version in this series, which we held review on for the list: > > Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain. > >>> However, upon further consideration, I am reluctant to implement this fix for > > Which fix? > >>> the following reasons: >>> - kvm_gmem_populate() already holds the kvm->slots_lock. >>> - While retrying with srcu unlock and lock can workaround the >>> KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory() >>> and tdx_handle_ept_violation() faulting with different memslot layouts. > > This behavior has existed since pretty much the beginning of KVM time. TDX is the > oddball that doesn't re-enter the guest. All other flavors re-enter the guest on > RET_PF_RETRY, which means dropping and reacquiring SRCU. Which is why I don't like > RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about. > > Arguably, _TDX_ is buggy by not providing this behavior. > >> I'm not sure why the second one is really a problem. For the first one I think >> that path could just take the scru lock in the proper order with kvm- >>> slots_lock? > > Acquiring SRCU inside slots_lock should be fine. The reserve order would be > problematic, as KVM synchronizes SRCU while holding slots_lock. > > That said, I don't love the idea of grabbing SRCU, because it's so obviously a > hack. What about something like this? > > --- > arch/x86/kvm/mmu.h | 2 ++ > arch/x86/kvm/mmu/mmu.c | 49 +++++++++++++++++++++++++++--------------- > arch/x86/kvm/vmx/tdx.c | 7 ++++-- > virt/kvm/kvm_main.c | 5 ++--- > 4 files changed, 41 insertions(+), 22 deletions(-) > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index b4b6860ab971..0fc68f0fe80e 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -259,6 +259,8 @@ extern bool tdp_mmu_enabled; > > bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa); > int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level); > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, > + u8 *level); > > static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) > { > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index cbc84c6abc2e..4f16fe95173c 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4851,24 +4851,15 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level > { > int r; > > - /* > - * Restrict to TDP page fault, since that's the only case where the MMU > - * is indexed by GPA. > - */ > - if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault) > - return -EOPNOTSUPP; > + if (signal_pending(current)) > + return -EINTR; > > - do { > - if (signal_pending(current)) > - return -EINTR; > + if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) > + return -EIO; > > - if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) > - return -EIO; > - > - cond_resched(); > - r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); > - } while (r == RET_PF_RETRY); > + cond_resched(); > > + r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); > if (r < 0) > return r; > > @@ -4878,10 +4869,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level > case RET_PF_WRITE_PROTECTED: > return 0; > > + case RET_PF_RETRY: > + return -EAGAIN; > + > case RET_PF_EMULATE: > return -ENOENT; > > - case RET_PF_RETRY: > case RET_PF_CONTINUE: > case RET_PF_INVALID: > default: > @@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level > } > EXPORT_SYMBOL_GPL(kvm_tdp_map_page); > > +int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level) > +{ > + int r; > + > + /* > + * Restrict to TDP page fault, since that's the only case where the MMU > + * is indexed by GPA. > + */ > + if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault) > + return -EOPNOTSUPP; > + > + for (;;) { > + r = kvm_tdp_map_page(vcpu, gpa, error_code, level); > + if (r != -EAGAIN) > + break; > + > + /* Comment goes here. */ > + kvm_vcpu_srcu_read_unlock(vcpu); > + kvm_vcpu_srcu_read_lock(vcpu); > + } added "return r" here. > +} > + > long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > struct kvm_pre_fault_memory *range) > { > @@ -4918,7 +4933,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > * Shadow paging uses GVA for kvm page fault, so restrict to > * two-dimensional paging. > */ > - r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level); > + r = kvm_tdp_prefault_page(vcpu, range->gpa, error_code, &level); > if (r < 0) > return r; > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index b952bc673271..1a232562080d 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -3075,8 +3075,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > if (ret != 1) > return -ENOMEM; > > - ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level); > - if (ret < 0) > + do { > + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level); > + while (ret == -EAGAIN); added closing '}' here > + > + if (ret) > goto out; > > /* > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index b24db92e98f3..21a3fa7476dd 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu) > static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > struct kvm_pre_fault_memory *range) > { > - int idx; > long r; > u64 full_size; > > @@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > return -EINVAL; > > vcpu_load(vcpu); > - idx = srcu_read_lock(&vcpu->kvm->srcu); > + kvm_vcpu_srcu_read_lock(vcpu); > > full_size = range->size; > do { > @@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, > cond_resched(); > } while (range->size); > > - srcu_read_unlock(&vcpu->kvm->srcu, idx); > + kvm_vcpu_srcu_read_unlock(vcpu); > vcpu_put(vcpu); > > /* Return success if at least one page was mapped successfully. */ > > base-commit: 12ca5c63556bbfcd77fe890fcdd1cd1adfb31fdd > -- Just reporting on the testing side ... I just made two small changes to patch as indicated. With these changes pre_fault_memory_test (with changes from patch #2) hangs (with KVM still responding to signals). fwiw, pre_fault_memory_test also hangs with [1]. I also tried out the TDX stress tests that you do not (yet) have access to. The stress tests passes with [1] but hangs with this change. Reinette [1] https://lore.kernel.org/lkml/aCsy-m_esVjy8Pey@xxxxxxxxxx/