Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot. This helps prevent deadlocks when a memslot is removed during pre-faulting GPAs in the memslot or local retry of faulting private pages in TDX. Take pre-faulting as an example. During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory() further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault() if the return value is RET_PF_RETRY. If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after kvm_invalidate_memslot() marks a slot as invalid and makes it visible via rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault() may encounter an invalid slot and return RET_PF_RETRY. Consequently, kvm_tdp_map_page() will then retry without releasing the srcu lock. Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock, leading to a deadlock. "slot deleting" thread "prefault" thread ----------------------------- ---------------------- srcu_read_lock(); (A) invalid_slot->flags |= KVM_MEMSLOT_INVALID; rcu_assign_pointer(); kvm_tdp_map_page(); (B) do { r = kvm_mmu_do_page_fault(); (C) synchronize_srcu_expedited(); } while (r == RET_PF_RETRY); (D) srcu_read_unlock(); As shown in diagram, (C) is waiting for (D). However, (B) continuously finds an invalid slot before (C) completes, causing (B) to retry and preventing (D) from being invoked. The local retry code in TDX's EPT violation handler faces a similar issue, where a deadlock can occur when faulting a private GFN in a slot that is concurrently being removed. To resolve the deadlock, introduce a new return value RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an invalid memslot. This prevents endless retries in kvm_tdp_map_page() or tdx_handle_ept_violation(), allowing the srcu to be released and enabling slot removal to proceed. As all callers of kvm_tdp_map_page(), i.e., kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of the slot removal. Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their callers either only check if the return value > 0 to re-enter vCPU for retry or do not check return value. Reported-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> --- arch/x86/kvm/mmu/mmu.c | 3 ++- arch/x86/kvm/mmu/mmu_internal.h | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index cbc84c6abc2e..3331e1e1aa69 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4599,7 +4599,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, * be zapped before KVM inserts a new MMIO SPTE for the gfn. */ if (slot->flags & KVM_MEMSLOT_INVALID) - return RET_PF_RETRY; + return RET_PF_RETRY_INVALID_SLOT; if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { /* @@ -4879,6 +4879,7 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level return 0; case RET_PF_EMULATE: + case RET_PF_RETRY_INVALID_SLOT: return -ENOENT; case RET_PF_RETRY: diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index db8f33e4de62..1aa14a32225e 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -311,6 +311,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. + * RET_PF_RETRY_INVALID_SLOT: Let CPU fault again on the address due to slot + * with flag KVM_MEMSLOT_INVALID. * * Any names added to this enum should be exported to userspace for use in * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h @@ -326,6 +328,7 @@ enum { RET_PF_INVALID, RET_PF_FIXED, RET_PF_SPURIOUS, + RET_PF_RETRY_INVALID_SLOT, }; /* -- 2.43.2