On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote: > On Tue, Sep 02, 2025, Yan Zhao wrote: > > But during writing another concurrency test, I found a sad news : > > > > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its > > leaf_opcode.version > 0. So, when I use v1 (which is the current value in > > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different > > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error > > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080". > > > > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone. > > Uh, so that's exactly the type of breaking ABI change that isn't acceptable. If > it's really truly necessary, then we can can probably handle the change in KVM > since TDX is so new, but generally speaking such changes simply must not happen. > > > Note: this acquiring of exclusive lock was not previously present in the public > > repo https://github.com/intel/tdx-module.git, branch tdx_1.5. > > (The branch has been force-updated to new implementation now). > > Lovely. Hmm, this exactly the kind of TDX module change we were just discussing reporting as a bug. Not clear on the timing of the change as far as the landing upstream. We could investigate whether whether we could fix it in the TDX module. This probably falls into the category of not actually regressing any userspace. But it does trigger a kernel warning, so warrant a fix, hmm. > > > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent > > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering. > > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps > > !mirror roots. The slots_lock should be for slots deletion. > > Oof, I missed that. We should have required nx_huge_pages=never for tdx=1. > Probably too late for that now though :-/ > > > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse > > > kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have > > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh > > > well. > > > > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because > > > taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively > > > problematic today, but it feels like a deadlock waiting to happen. > > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside > > vcpu->mutex. > > Yikes. As does kvm_alloc_apic_access_page(), which is likely why I thought it > was ok to take slots_lock. But while kvm_alloc_apic_access_page() appears to be > called with vCPU scope, it's actually called from VM scope during vCPU creation. > > I'll chew on this, though if someone has any ideas... > > > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well? > > If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions > take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would > suffice, and probably would be preferable. If INIT_VCPU needs to take kvm->lock > to protect against other races, then I guess the big hammer approach could work? A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the reasoning easier to follow. Like, you don't need to remember "we take slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of adding more locks, and have argued against it in the past. But are we just fooling ourselves though? There are already more locks. Another reason to duplicate (some) locks is that if it gives the scheduler more hints as far as waking up waiters, etc. The TDX module needs these locks to protect itself, so those are required. But when we just do retry loops (or let userspace do this), then we lose out on all of the locking goodness in the kernel. Anyway, just a strawman. I don't have any great ideas.