Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM

Adrian Hunter <adrian.hunter@xxxxxxxxx> · Sun, 3 Aug 2025 20:41:14 +0300

On 01/08/2025 19:44, Sean Christopherson wrote:
> +Chao
> 
> On Fri, Aug 01, 2025, Adrian Hunter wrote:
>> On 29/07/2025 22:33, Sean Christopherson wrote:
>>> +static int tdx_terminate_vm(struct kvm *kvm)
>>> +{
>>> +	if (kvm_trylock_all_vcpus(kvm))
>>> +		return -EBUSY;
>>> +
>>> +	kvm_vm_dead(kvm);
>>> +	to_kvm_tdx(kvm)->vm_terminated = true;
>>> +
>>> +	kvm_unlock_all_vcpus(kvm);
>>> +
>>> +	tdx_mmu_release_hkid(kvm);
>>> +
>>> +	return 0;
>>> +}
>>
>> As I think I mentioned when removing vm_dead first came up,
>> I think we need more checks.  I spent some time going through
>> the code and came up with what is below:
>>
>> First, we need to avoid TDX VCPU sub-IOCTLs from racing with
>> tdx_mmu_release_hkid().  But having any TDX sub-IOCTL run after
>> KVM_TDX_TERMINATE_VM raises questions of what might happen, so
>> it is much simpler to understand, if that is not possible.
>> There are 3 options:
>>
>> 1. Require that KVM_TDX_TERMINATE_VM is valid only if
>> kvm_tdx->state == TD_STATE_RUNNABLE.  Since currently all
>> the TDX sub-IOCTLs are for initialization, that would block
>> the opportunity for any to run after KVM_TDX_TERMINATE_VM.
>>
>> 2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl()
>>
>> 3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl()
>>
>> [ Note cannot check is_hkid_assigned() because that is racy ]
>>
>> Secondly, I suggest we avoid SEAMCALLs that will fail and
>> result in KVM_BUG_ON() if HKID has been released.
>>
>> There are 2 groups of those: MMU-related and TDVPS_ACCESSORS.
>>
>> For the MMU-related, the following 2 functions should return
>> an error immediately if vm_terminated:
>>
>> 	tdx_sept_link_private_spt()
>> 	tdx_sept_set_private_spte()
>>
>> For that not be racy, extra synchronization is needed so that
>> vm_terminated can be reliably checked when holding mmu lock
>> i.e.
>>
>> static int tdx_terminate_vm(struct kvm *kvm)
>> {
>> 	if (kvm_trylock_all_vcpus(kvm))
>> 		return -EBUSY;
>>
>> 	kvm_vm_dead(kvm);
>> +
>> +       write_lock(&kvm->mmu_lock);
>> 	to_kvm_tdx(kvm)->vm_terminated = true;
>> +       write_unlock(&kvm->mmu_lock);
>>
>> 	kvm_unlock_all_vcpus(kvm);
>>
>> 	tdx_mmu_release_hkid(kvm);
>>
>> 	return 0;
>> }
>>
>> Finally, there are 2 TDVPS_ACCESSORS that need avoiding:
>>
>> 	tdx_load_mmu_pgd()
>> 		skip td_vmcs_write64() if vm_terminated
>>
>> 	tdx_protected_apic_has_interrupt()
>> 		skip td_state_non_arch_read64() if vm_terminated
> 
> Oof.  And as Chao pointed out[*], removing the vm_dead check would allow creating
> and running vCPUs in a dead VM, which is most definitely not desirable.  Squashing
> the vCPU creation case is easy enough if we keep vm_dead but still generally allow
> ioctls, and it's probably worth doing that no matter what (to plug the hole where
> pending vCPU creations could succeed):
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d477a7fda0ae..941d2c32b7dc 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
>  
>         mutex_lock(&kvm->lock);
>  
> +       if (kvm->vm_dead) {
> +               r = -EIO;
> +               goto unlock_vcpu_destroy;
> +       }
> +
>         if (kvm_get_vcpu_by_id(kvm, id)) {
>                 r = -EEXIST;
>                 goto unlock_vcpu_destroy;
> 
> And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring
> vcpu->mutex.
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6c07dd423458..883077eee4ce 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  
>         if (mutex_lock_killable(&vcpu->mutex))
>                 return -EINTR;
> +
> +       if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) {
> +               r = -EIO;
> +               goto out;
> +       }
> +
>         switch (ioctl) {
>         case KVM_RUN: {
>                 struct pid *oldpid;
> 
> 
> That should address all TDVPS paths (I hope), and I _think_ would address all
> MMU-related paths as well?  E.g. prefault requires a vCPU.
> 
> Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI
> (understatement), but I think we need/want the above changes even if we keep the
> general vm_dead restriction.  And given the extremely ad hoc behavior of taking
> kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like
> a fool's errand.
> 
> So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not
> simply marked dead" (with a different shortlog+changelog), but keeping vm_dead
> (and not introducing kvm_tdx.vm_terminated).
> 
> Thoughts?

That covers the cases I listed, so it is fine by me.