On Wed, Sep 03, 2025 at 08:18:10AM +0800, Edgecombe, Rick P wrote: > On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote: > > On Tue, Sep 02, 2025, Yan Zhao wrote: > > > But during writing another concurrency test, I found a sad news : > > > > > > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its > > > leaf_opcode.version > 0. So, when I use v1 (which is the current value in > > > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different > > > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error > > > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080". > > > > > > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone. > > > > Uh, so that's exactly the type of breaking ABI change that isn't acceptable. If > > it's really truly necessary, then we can can probably handle the change in KVM > > since TDX is so new, but generally speaking such changes simply must not happen. > > > > > Note: this acquiring of exclusive lock was not previously present in the public > > > repo https://github.com/intel/tdx-module.git, branch tdx_1.5. > > > (The branch has been force-updated to new implementation now). > > > > Lovely. > > Hmm, this exactly the kind of TDX module change we were just discussing > reporting as a bug. Not clear on the timing of the change as far as the landing > upstream. We could investigate whether whether we could fix it in the TDX > module. This probably falls into the category of not actually regressing any > userspace. But it does trigger a kernel warning, so warrant a fix, hmm. > > > > > > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent > > > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering. > > > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps > > > !mirror roots. The slots_lock should be for slots deletion. > > > > Oof, I missed that. We should have required nx_huge_pages=never for tdx=1. > > Probably too late for that now though :-/ > > > > > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse > > > > kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have > > > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh > > > > well. > > > > > > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because > > > > taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively > > > > problematic today, but it feels like a deadlock waiting to happen. > > > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside > > > vcpu->mutex. > > > > Yikes. As does kvm_alloc_apic_access_page(), which is likely why I thought it > > was ok to take slots_lock. But while kvm_alloc_apic_access_page() appears to be > > called with vCPU scope, it's actually called from VM scope during vCPU creation. > > > > I'll chew on this, though if someone has any ideas... > > > > > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well? > > > > If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions > > take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would > > suffice, and probably would be preferable. If INIT_VCPU needs to take kvm->lock > > to protect against other races, then I guess the big hammer approach could work? We need the big hammer approach as INIT_VCPU may race with vcpu_load() in other vCPU ioctls. > A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the > reasoning easier to follow. Like, you don't need to remember "we take > slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of > adding more locks, and have argued against it in the past. But are we just > fooling ourselves though? There are already more locks. > > Another reason to duplicate (some) locks is that if it gives the scheduler more > hints as far as waking up waiters, etc. The TDX module needs these locks to > protect itself, so those are required. But when we just do retry loops (or let > userspace do this), then we lose out on all of the locking goodness in the > kernel. > > Anyway, just a strawman. I don't have any great ideas. Do you think the following fix is good? It moves INIT_VCPU to tdx_vcpu_async_ioctl and uses the big hammer to make it impossible to contend with other SEAMCALLs, just as for tdh_mr_extend(). It passed my local concurrent test. diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 99381c8b4108..8a6f2feaab41 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -3047,16 +3047,22 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) { - u64 apic_base; + struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); struct vcpu_tdx *tdx = to_tdx(vcpu); + u64 apic_base; int ret; + CLASS(tdx_vm_state_guard, guard)(vcpu->kvm); + if (IS_ERR(guard)) + return PTR_ERR(guard); + if (cmd->flags) return -EINVAL; - if (tdx->state != VCPU_TD_STATE_UNINITIALIZED) + if (!is_hkid_assigned(kvm_tdx) || tdx->state != VCPU_TD_STATE_UNINITIALIZED) return -EINVAL; + vcpu_load(vcpu); /* * TDX requires X2APIC, userspace is responsible for configuring guest * CPUID accordingly. @@ -3075,6 +3081,7 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR); tdx->state = VCPU_TD_STATE_INITIALIZED; + vcpu_put(vcpu); return 0; } @@ -3228,10 +3235,18 @@ int tdx_vcpu_async_ioctl(struct kvm_vcpu *vcpu, void __user *argp) if (r) return r; - if (cmd.id != KVM_TDX_INIT_MEM_REGION) - return -ENOIOCTLCMD; - - return tdx_vcpu_init_mem_region(vcpu, &cmd); + switch (cmd.id) { + case KVM_TDX_INIT_MEM_REGION: + r = tdx_vcpu_init_mem_region(vcpu, &cmd); + break; + case KVM_TDX_INIT_VCPU: + r = tdx_vcpu_init(vcpu, &cmd); + break; + default: + r = -ENOIOCTLCMD; + break; + } + return r; } int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) @@ -3248,9 +3263,6 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) return ret; switch (cmd.id) { - case KVM_TDX_INIT_VCPU: - ret = tdx_vcpu_init(vcpu, &cmd); - break; case KVM_TDX_GET_CPUID: ret = tdx_vcpu_get_cpuid(vcpu, &cmd); break; Besides, to unblock testing the above code, I fixed a bug related to vcpu_load() in current TDX upstream code. Attached the fixup patch below. Sean, please let me know if you want to include it into this series. (It still lacks a Fixes tag as I haven't found out which commit is the best fit). >From 0d1ba6d60315e34bdb0e54acceb6e8dd0fbdb262 Mon Sep 17 00:00:00 2001 From: Yan Zhao <yan.y.zhao@xxxxxxxxx> Date: Tue, 2 Sep 2025 18:31:27 -0700 Subject: [PATCH 1/2] KVM: TDX: Fix list_add corruption during vcpu_load() During vCPU creation, a vCPU may be destroyed immediately after kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))" before invoking tdx_vcpu_free(). Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)") will cause list corruption of the per-pCPU list associated_tdvcpus. Then, a later list_add() during vcpu_load() would detect list corruption and print calltrace as shown below. Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs destroyed immediately after creation which must be in VCPU_TD_STATE_UNINITIALIZED state. kernel BUG at lib/list_debug.c:29! Oops: invalid opcode: 0000 [#2] SMP NOPTI RIP: 0010:__list_add_valid_or_report+0x82/0xd0 Call Trace: <TASK> tdx_vcpu_load+0xa8/0x120 vt_vcpu_load+0x25/0x30 kvm_arch_vcpu_load+0x81/0x300 vcpu_load+0x55/0x90 kvm_arch_vcpu_create+0x24f/0x330 kvm_vm_ioctl_create_vcpu+0x1b1/0x53 ? trace_lock_release+0x6d/0xb0 kvm_vm_ioctl+0xc2/0xa60 ? tty_ldisc_deref+0x16/0x20 ? debug_smp_processor_id+0x17/0x20 ? __fget_files+0xc2/0x1b0 ? debug_smp_processor_id+0x17/0x20 ? rcu_is_watching+0x13/0x70 ? __fget_files+0xc2/0x1b0 ? trace_lock_release+0x6d/0xb0 ? lock_release+0x14/0xd0 ? __fget_files+0xcc/0x1b0 __x64_sys_ioctl+0x9a/0xf0 ? rcu_is_watching+0x13/0x70 x64_sys_call+0x10ee/0x20d0 do_syscall_64+0xc3/0x470 entry_SYSCALL_64_after_hwframe+0x77/0x7f Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> --- arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index e99d07611393..99381c8b4108 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -837,19 +837,51 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu) tdx_prepare_switch_to_host(vcpu); } +/* + * Life cycles for a TD and a vCPU: + * 1. KVM_CREATE_VM ioctl. + * TD state is TD_STATE_UNINITIALIZED. + * hkid is not assigned at this stage. + * 2. KVM_TDX_INIT_VM ioctl. + * TD transistions to TD_STATE_INITIALIZED. + * hkid is assigned after this stage. + * 3. KVM_CREATE_VCPU ioctl. (only when TD is TD_STATE_INITIALIZED). + * 3.1 tdx_vcpu_create() transitions vCPU state to VCPU_TD_STATE_UNINITIALIZED. + * 3.2 vcpu_load() and vcpu_put() in kvm_arch_vcpu_create(). + * 3.3 (conditional) if any error encountered after kvm_arch_vcpu_create() + * kvm_arch_vcpu_destroy() --> tdx_vcpu_free(). + * 4. KVM_TDX_INIT_VCPU ioctl. + * tdx_vcpu_init() transistions vCPU state to VCPU_TD_STATE_INITIALIZED. + * vCPU control structures are allocated at this stage. + * 5. kvm_destroy_vm(). + * 5.1 tdx_mmu_release_hkid(): (1) tdh_vp_flush(), disassociats all vCPUs. + * (2) puts hkid to !assigned state. + * 5.2 kvm_destroy_vcpus() --> tdx_vcpu_free(): + * transistions vCPU to VCPU_TD_STATE_UNINITIALIZED state. + * 5.3 tdx_vm_destroy() + * transitions TD to TD_STATE_UNINITIALIZED state. + * + * tdx_vcpu_free() can be invoked only at 3.3 or 5.2. + * - If at 3.3, hkid is still assigned, but the vCPU must be in + * VCPU_TD_STATE_UNINITIALIZED state. + * - if at 5.2, hkid must be !assigned and all vCPUs must be in + * VCPU_TD_STATE_INITIALIZED state and have been dissociated. + */ void tdx_vcpu_free(struct kvm_vcpu *vcpu) { struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); struct vcpu_tdx *tdx = to_tdx(vcpu); int i; + if (vcpu->cpu != -1) { + KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm); + tdx_disassociate_vp(vcpu); + return; + } /* * It is not possible to reclaim pages while hkid is assigned. It might - * be assigned if: - * 1. the TD VM is being destroyed but freeing hkid failed, in which - * case the pages are leaked - * 2. TD VCPU creation failed and this on the error path, in which case - * there is nothing to do anyway + * be assigned if the TD VM is being destroyed but freeing hkid failed, + * in which case the pages are leaked. */ if (is_hkid_assigned(kvm_tdx)) return; -- 2.43.0