Hi Sean, On 5/19/25 6:33 AM, Sean Christopherson wrote: > On Mon, May 19, 2025, Yan Zhao wrote: >> Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of >> kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot. >> This helps prevent deadlocks when a memslot is removed during pre-faulting >> GPAs in the memslot or local retry of faulting private pages in TDX. >> >> Take pre-faulting as an example. >> >> During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the >> pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory() >> further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault() >> if the return value is RET_PF_RETRY. >> >> If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after >> kvm_invalidate_memslot() marks a slot as invalid and makes it visible via >> rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault() >> may encounter an invalid slot and return RET_PF_RETRY. Consequently, >> kvm_tdp_map_page() will then retry without releasing the srcu lock. >> Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is >> blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock, >> leading to a deadlock. > > Probably worth calling out that KVM will respond to signals, i.e. there's no risk > to the host kernel. > >> "slot deleting" thread "prefault" thread >> ----------------------------- ---------------------- >> srcu_read_lock(); >> (A) >> invalid_slot->flags |= KVM_MEMSLOT_INVALID; >> rcu_assign_pointer(); >> >> kvm_tdp_map_page(); >> (B) >> do { >> r = kvm_mmu_do_page_fault(); >> >> (C) synchronize_srcu_expedited(); >> >> } while (r == RET_PF_RETRY); >> >> (D) srcu_read_unlock(); >> >> As shown in diagram, (C) is waiting for (D). However, (B) continuously >> finds an invalid slot before (C) completes, causing (B) to retry and >> preventing (D) from being invoked. >> >> The local retry code in TDX's EPT violation handler faces a similar issue, >> where a deadlock can occur when faulting a private GFN in a slot that is >> concurrently being removed. >> >> To resolve the deadlock, introduce a new return value >> RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return >> RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an >> invalid memslot. This prevents endless retries in kvm_tdp_map_page() or >> tdx_handle_ept_violation(), allowing the srcu to be released and enabling >> slot removal to proceed. >> >> As all callers of kvm_tdp_map_page(), i.e., >> kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in >> pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE >> to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of >> the slot removal. > > Userspace should already be "aware" of the slot removal. > >> Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not >> affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their >> callers either only check if the return value > 0 to re-enter vCPU for >> retry or do not check return value. >> >> Reported-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> > > Was this hit by a real VMM? If so, why is a TDX VMM removing a memslot without > kicking vCPUs out of KVM? No, this was not hit by a real VMM. This was hit by a TDX MMU stress test (built on top of [1]) that is still under development. Reinette [1] https://lore.kernel.org/lkml/20250414214801.2693294-1-sagis@xxxxxxxxxx/