On Mon, May 19, 2025 at 06:33:14AM -0700, Sean Christopherson wrote: > On Mon, May 19, 2025, Yan Zhao wrote: > > Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of > > kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot. > > This helps prevent deadlocks when a memslot is removed during pre-faulting > > GPAs in the memslot or local retry of faulting private pages in TDX. > > > > Take pre-faulting as an example. > > > > During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the > > pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory() > > further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault() > > if the return value is RET_PF_RETRY. > > > > If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after > > kvm_invalidate_memslot() marks a slot as invalid and makes it visible via > > rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault() > > may encounter an invalid slot and return RET_PF_RETRY. Consequently, > > kvm_tdp_map_page() will then retry without releasing the srcu lock. > > Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is > > blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock, > > leading to a deadlock. > > Probably worth calling out that KVM will respond to signals, i.e. there's no risk > to the host kernel. Yes. The user can stop at any time by killing the process. No risk to the host kernel. > > > "slot deleting" thread "prefault" thread > > ----------------------------- ---------------------- > > srcu_read_lock(); > > (A) > > invalid_slot->flags |= KVM_MEMSLOT_INVALID; > > rcu_assign_pointer(); > > > > kvm_tdp_map_page(); > > (B) > > do { > > r = kvm_mmu_do_page_fault(); > > > > (C) synchronize_srcu_expedited(); > > > > } while (r == RET_PF_RETRY); > > > > (D) srcu_read_unlock(); > > > > As shown in diagram, (C) is waiting for (D). However, (B) continuously > > finds an invalid slot before (C) completes, causing (B) to retry and > > preventing (D) from being invoked. > > > > The local retry code in TDX's EPT violation handler faces a similar issue, > > where a deadlock can occur when faulting a private GFN in a slot that is > > concurrently being removed. > > > > To resolve the deadlock, introduce a new return value > > RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return > > RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an > > invalid memslot. This prevents endless retries in kvm_tdp_map_page() or > > tdx_handle_ept_violation(), allowing the srcu to be released and enabling > > slot removal to proceed. > > > > As all callers of kvm_tdp_map_page(), i.e., > > kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in > > pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE > > to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of > > the slot removal. > > Userspace should already be "aware" of the slot removal. Hmm, I mean let userspace know there's an error due to faulting into a slot under removal. > > Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not > > affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their > > callers either only check if the return value > 0 to re-enter vCPU for > > retry or do not check return value. > > > > Reported-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> > > Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without > kicking vCPUs out of KVM? Not a real VMM. It's solely for stress testing (or maybe better call it negative testing). For TDX, now the only memslot removal case is for memory hot-unplug, which notifies guest in advance. > Regardless, I would prefer not to add a new RET_PF_* flag for this. At a glance, > KVM can simply drop and reacquire SRCU in the relevant paths. Yes, we can switch to SRCU if you prefer that path. Previously, I thought it's redudant to acquire SRCU in kvm_gmem_populate() as it already holds slots_lock. I also assumed you would prefer ioctl KVM_PRE_FAULT_MEMORY to fault under a single memslot layout, because kvm_vcpu_pre_fault_memory() acquires SRCU for the entire range. Otherwise, could we acquire the SRCU for each kvm_mmu_do_page_fault()? diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index cbc84c6abc2e..15e98202868a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4859,6 +4859,14 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level return -EOPNOTSUPP; do { + /* + * reload is efficient when called repeatedly, so we can do it on + * every iteration. + */ + r = kvm_mmu_reload(vcpu); + if (unlikely(r)) + return r; + if (signal_pending(current)) return -EINTR; @@ -4866,7 +4874,10 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level return -EIO; cond_resched(); + + kvm_vcpu_srcu_read_lock(vcpu); r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); + kvm_vcpu_srcu_read_unlock(vcpu); } while (r == RET_PF_RETRY); if (r < 0) @@ -4902,14 +4913,6 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, if (!vcpu->kvm->arch.pre_fault_allowed) return -EOPNOTSUPP; - /* - * reload is efficient when called repeatedly, so we can do it on - * every iteration. - */ - r = kvm_mmu_reload(vcpu); - if (r) - return r; - if (kvm_arch_has_private_mem(vcpu->kvm) && kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa))) error_code |= PFERR_PRIVATE_ACCESS; diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index b952bc673271..b1f20c7fd17d 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1920,7 +1920,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) break; } + kvm_vcpu_srcu_read_unlock(vcpu); cond_resched(); + kvm_vcpu_srcu_read_lock(vcpu); } return ret; } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index b24db92e98f3..c502105905af 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu) static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, struct kvm_pre_fault_memory *range) { - int idx; long r; u64 full_size; @@ -4279,7 +4278,6 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, return -EINVAL; vcpu_load(vcpu); - idx = srcu_read_lock(&vcpu->kvm->srcu); full_size = range->size; do { @@ -4300,7 +4298,6 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, cond_resched(); } while (range->size); - srcu_read_unlock(&vcpu->kvm->srcu, idx); vcpu_put(vcpu); /* Return success if at least one page was mapped successfully. */