Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Mon, 19 May 2025 08:05:50 -0700

Hi Sean,

On 5/19/25 6:33 AM, Sean Christopherson wrote:
> On Mon, May 19, 2025, Yan Zhao wrote:
>> Introduce a new return value RET_PF_RETRY_INVALID_SLOT to inform callers of
>> kvm_mmu_do_page_fault() that a fault retry is due to an invalid memslot.
>> This helps prevent deadlocks when a memslot is removed during pre-faulting
>> GPAs in the memslot or local retry of faulting private pages in TDX.
>>
>> Take pre-faulting as an example.
>>
>> During ioctl KVM_PRE_FAULT_MEMORY, kvm->srcu is acquired around the
>> pre-faulting of the entire range. For x86, kvm_arch_vcpu_pre_fault_memory()
>> further invokes kvm_tdp_map_page(), which retries kvm_mmu_do_page_fault()
>> if the return value is RET_PF_RETRY.
>>
>> If a memslot is deleted during the ioctl KVM_PRE_FAULT_MEMORY, after
>> kvm_invalidate_memslot() marks a slot as invalid and makes it visible via
>> rcu_assign_pointer() in kvm_swap_active_memslots(), kvm_mmu_do_page_fault()
>> may encounter an invalid slot and return RET_PF_RETRY. Consequently,
>> kvm_tdp_map_page() will then retry without releasing the srcu lock.
>> Meanwhile, synchronize_srcu_expedited() in kvm_swap_active_memslots() is
>> blocked, waiting for kvm_vcpu_pre_fault_memory() to release the srcu lock,
>> leading to a deadlock.
> 
> Probably worth calling out that KVM will respond to signals, i.e. there's no risk
> to the host kernel.
> 
>> "slot deleting" thread                   "prefault" thread
>> -----------------------------            ----------------------
>>                                          srcu_read_lock();
>> (A)
>> invalid_slot->flags |= KVM_MEMSLOT_INVALID;
>> rcu_assign_pointer();
>>
>>                                          kvm_tdp_map_page();
>>                                          (B)
>>                                             do {
>>                                                r = kvm_mmu_do_page_fault();
>>
>> (C) synchronize_srcu_expedited();
>>
>>                                             } while (r == RET_PF_RETRY);
>>
>>                                          (D) srcu_read_unlock();
>>
>> As shown in diagram, (C) is waiting for (D). However, (B) continuously
>> finds an invalid slot before (C) completes, causing (B) to retry and
>> preventing (D) from being invoked.
>>
>> The local retry code in TDX's EPT violation handler faces a similar issue,
>> where a deadlock can occur when faulting a private GFN in a slot that is
>> concurrently being removed.
>>
>> To resolve the deadlock, introduce a new return value
>> RET_PF_RETRY_INVALID_SLOT and modify kvm_mmu_do_page_fault() to return
>> RET_PF_RETRY_INVALID_SLOT instead of RET_PF_RETRY when encountering an
>> invalid memslot. This prevents endless retries in kvm_tdp_map_page() or
>> tdx_handle_ept_violation(), allowing the srcu to be released and enabling
>> slot removal to proceed.
>>
>> As all callers of kvm_tdp_map_page(), i.e.,
>> kvm_arch_vcpu_pre_fault_memory() or tdx_gmem_post_populate(), are in
>> pre-fault path, treat RET_PF_RETRY_INVALID_SLOT the same as RET_PF_EMULATE
>> to return -ENOENT in kvm_tdp_map_page() to enable userspace to be aware of
>> the slot removal.
> 
> Userspace should already be "aware" of the slot removal.
> 
>> Returning RET_PF_RETRY_INVALID_SLOT in kvm_mmu_do_page_fault() does not
>> affect kvm_mmu_page_fault() and kvm_arch_async_page_ready(), as their
>> callers either only check if the return value > 0 to re-enter vCPU for
>> retry or do not check return value.
>>
>> Reported-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
> 
> Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> kicking vCPUs out of KVM?

No, this was not hit by a real VMM. This was hit by a TDX MMU stress test (built
on top of [1]) that is still under development.

Reinette

[1] https://lore.kernel.org/lkml/20250414214801.2693294-1-sagis@xxxxxxxxxx/