On 01/08/2025 19:44, Sean Christopherson wrote: > +Chao > > On Fri, Aug 01, 2025, Adrian Hunter wrote: >> On 29/07/2025 22:33, Sean Christopherson wrote: >>> +static int tdx_terminate_vm(struct kvm *kvm) >>> +{ >>> + if (kvm_trylock_all_vcpus(kvm)) >>> + return -EBUSY; >>> + >>> + kvm_vm_dead(kvm); >>> + to_kvm_tdx(kvm)->vm_terminated = true; >>> + >>> + kvm_unlock_all_vcpus(kvm); >>> + >>> + tdx_mmu_release_hkid(kvm); >>> + >>> + return 0; >>> +} >> >> As I think I mentioned when removing vm_dead first came up, >> I think we need more checks. I spent some time going through >> the code and came up with what is below: >> >> First, we need to avoid TDX VCPU sub-IOCTLs from racing with >> tdx_mmu_release_hkid(). But having any TDX sub-IOCTL run after >> KVM_TDX_TERMINATE_VM raises questions of what might happen, so >> it is much simpler to understand, if that is not possible. >> There are 3 options: >> >> 1. Require that KVM_TDX_TERMINATE_VM is valid only if >> kvm_tdx->state == TD_STATE_RUNNABLE. Since currently all >> the TDX sub-IOCTLs are for initialization, that would block >> the opportunity for any to run after KVM_TDX_TERMINATE_VM. >> >> 2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl() >> >> 3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl() >> >> [ Note cannot check is_hkid_assigned() because that is racy ] >> >> Secondly, I suggest we avoid SEAMCALLs that will fail and >> result in KVM_BUG_ON() if HKID has been released. >> >> There are 2 groups of those: MMU-related and TDVPS_ACCESSORS. >> >> For the MMU-related, the following 2 functions should return >> an error immediately if vm_terminated: >> >> tdx_sept_link_private_spt() >> tdx_sept_set_private_spte() >> >> For that not be racy, extra synchronization is needed so that >> vm_terminated can be reliably checked when holding mmu lock >> i.e. >> >> static int tdx_terminate_vm(struct kvm *kvm) >> { >> if (kvm_trylock_all_vcpus(kvm)) >> return -EBUSY; >> >> kvm_vm_dead(kvm); >> + >> + write_lock(&kvm->mmu_lock); >> to_kvm_tdx(kvm)->vm_terminated = true; >> + write_unlock(&kvm->mmu_lock); >> >> kvm_unlock_all_vcpus(kvm); >> >> tdx_mmu_release_hkid(kvm); >> >> return 0; >> } >> >> Finally, there are 2 TDVPS_ACCESSORS that need avoiding: >> >> tdx_load_mmu_pgd() >> skip td_vmcs_write64() if vm_terminated >> >> tdx_protected_apic_has_interrupt() >> skip td_state_non_arch_read64() if vm_terminated > > Oof. And as Chao pointed out[*], removing the vm_dead check would allow creating > and running vCPUs in a dead VM, which is most definitely not desirable. Squashing > the vCPU creation case is easy enough if we keep vm_dead but still generally allow > ioctls, and it's probably worth doing that no matter what (to plug the hole where > pending vCPU creations could succeed): > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index d477a7fda0ae..941d2c32b7dc 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id) > > mutex_lock(&kvm->lock); > > + if (kvm->vm_dead) { > + r = -EIO; > + goto unlock_vcpu_destroy; > + } > + > if (kvm_get_vcpu_by_id(kvm, id)) { > r = -EEXIST; > goto unlock_vcpu_destroy; > > And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring > vcpu->mutex. > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 6c07dd423458..883077eee4ce 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp, > > if (mutex_lock_killable(&vcpu->mutex)) > return -EINTR; > + > + if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) { > + r = -EIO; > + goto out; > + } > + > switch (ioctl) { > case KVM_RUN: { > struct pid *oldpid; > > > That should address all TDVPS paths (I hope), and I _think_ would address all > MMU-related paths as well? E.g. prefault requires a vCPU. > > Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI > (understatement), but I think we need/want the above changes even if we keep the > general vm_dead restriction. And given the extremely ad hoc behavior of taking > kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like > a fool's errand. > > So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not > simply marked dead" (with a different shortlog+changelog), but keeping vm_dead > (and not introducing kvm_tdx.vm_terminated). > > Thoughts? That covers the cases I listed, so it is fine by me.