Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of kvm_mmu_do_page_fault() that a fault retry is caused by an invalid memslot. This helps prevent deadlock when a memslot is removed during pre-faulting or local retry of faulting private pages in TDX. Take pre-faulting as an example. Removing a memslot during the ioctl KVM_PRE_FAULT_MEMORY on x86 can lead to a kernel deadlock. "slot deleting" thread "pre-fault" thread ----------------------------- ---------------------- srcu_read_lock(); (A) invalid_slot->flags |= KVM_MEMSLOT_INVALID rcu_assign_pointer(); kvm_tdp_map_page(); (B) do { r = kvm_mmu_do_page_fault(); (C) synchronize_srcu_expedited(); } while (r == RET_PF_RETRY); (D) srcu_read_unlock(); The deadlock occurs because (C) is waiting for (D) to complete, but (B) repeatedly encounters an invalid slot and retries before (C) can finish, thus preventing (D) from being executed. (If the srcu_read_lock() in the pre-fault thread is invoked after (A) and before (C), deadlock can also occur). The local retry code in TDX's EPT violation handler faces a similar issue, where deadlock can occur if an invalid slot is found when faulting a private GFN. To resolve the deadlock, modify kvm_mmu_do_page_fault() to return RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an invalid memslot. This change enables the pre-fault thread or TDX's EPT violation handler to exit the loop and release the srcu lock. There're some alternative solutions to address the deadlock. 1) Return -EAGAIN for an invalid slot. Cons: This approach involves an uAPI change, requiring userspace to ignore error -EAGAIN and re-run the vCPU. While QEMU already ignores -EAGAIN, selftests may need updates. 2) Call kvm_mmu_prepare_memory_fault_exit() and return -EFAULT for an invalid slot. Cons: - kvm_mmu_prepare_memory_fault_exit() is not recognized by userspace for normal VMs. - Returning -EFAULT when a memslot is still invalid (i.e., not completely removed) may confuse userspace. 3) Release the srcu lock before cond_resched() and re-acquire it afterward in kvm_tdp_map_page() and tdx_handle_ept_violation(): Cons: - It allows each kvm_vcpu_pre_fault_memory() to pre-fault memory with different memslot layouts. - It requires tdx_gmem_post_populate() to acquire the srcu lock before invoking kvm_tdp_map_page(), which is redundant since tdx_gmem_post_populate() already holds the kvm->slots_lock. Patch 1 opts to introduce RET_PF_RETRY_INVALID_SLOT, which prevents endless retries in kvm_tdp_map_page() and tdx_handle_ept_violation() while avoiding exiting to userspace. Patch 2 provides a selftest to reproduce the deadlock when patch 1 is absent and verifies the pre-fault can correctly completes when the faulting range is in a removing memslot. Note: The selftest pre_fault_memory_test relies on commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()") to run the second ioctl KVM_PRE_FAULT_MEMORY successfully after a memory slot is removed during the first ioctl KVM_PRE_FAULT_MEMORY. Base commit: kvm-x86/next or kvm/next + commit 20a6cff3b283 ("KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()") Yan Zhao (2): KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot KVM: selftests: Test prefault memory with concurrent memslot removal arch/x86/kvm/mmu/mmu.c | 3 +- arch/x86/kvm/mmu/mmu_internal.h | 3 + .../selftests/kvm/pre_fault_memory_test.c | 82 +++++++++++++++---- 3 files changed, 72 insertions(+), 16 deletions(-) base-commit: e59ae112a7c2e0f58d9f5905de299fe206f84af0 -- 2.43.2