On Wed, Jun 18, 2025, Adrian Hunter wrote: > On 18/06/2025 09:00, Vishal Annapurve wrote: > > On Tue, Jun 17, 2025 at 10:50 PM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: > >>> Ability to clean up memslots from userspace without closing > >>> VM/guest_memfd handles is useful to keep reusing the same guest_memfds > >>> for the next boot iteration of the VM in case of reboot. > >> > >> TD lifecycle does not include reboot. In other words, reboot is > >> done by shutting down the TD and then starting again with a new TD. > >> > >> AFAIK it is not currently possible to shut down without closing > >> guest_memfds since the guest_memfd holds a reference (users_count) > >> to struct kvm, and destruction begins when users_count hits zero. > >> > > > > gmem link support[1] allows associating existing guest_memfds with new > > VM instances. > > > > Breakdown of the userspace VMM flow: > > 1) Create a new VM instance before closing guest_memfd files. > > 2) Link existing guest_memfd files with the new VM instance. -> This > > creates new set of files backed by the same inode but associated with > > the new VM instance. > > So what about: > > 2.5) Call KVM_TDX_TERMINATE_VM IOCTL > > Memory reclaimed after KVM_TDX_TERMINATE_VM will be done efficiently, > so avoid causing it to be reclaimed earlier. The problem is that setting kvm->vm_dead will prevent (3) from succeeding. If kvm->vm_dead is set, KVM will reject all vCPU, VM, and device (not /dev/kvm the device, but rather devices bound to the VM) ioctls. I intended that behavior, e.g. to guard against userspace blowing up KVM because the hkid was released, I just didn't consider the memslots angle. The other thing I didn't consider at the time, is that vm_dead doesn't fully protect against ioctls that are already in flight. E.g. see commit 17c80d19df6f ("KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight"). Though without a failure/splat of some kind, it's impossible to to know if this is actually a problem. I.e. I don't think we should *entirely* scrap blocking ioctls just because it *might* not be perfect (we can always make KVM better). I can think of a few options: 1. Skip the vm_dead check if userspace is deleting a memslot. 2. Provide a way for userspace to delete all memslots, and have that bypass vm_dead. 3. Delete all memslots on KVM_TDX_TERMINATE_VM. 4. Remove vm_dead and instead reject ioctls based on vm_bugged, and simply rely on KVM_REQ_VM_DEAD to prevent running the guest. I.e. tweak kvm_vm_dead() to be that it only prevents running the VM, and have kvm_vm_bugged() be the "something is badly broken, try to limit the damage". I'm heavily leaning toward #4. #1 is doable, but painful. #2 is basically #1, but with new uAPI. #3 is all kinds of gross, e.g. userspace might want to simply kill the VM and move on. KVM would still block ioctls, but only if a bug was detected. And the few use cases where KVM just wants to prevent entering the guest won't prevent gracefully tearing down the VM. Hah! And there's actually a TDX bug fix here, because "checking" for KVM_REQ_VM_DEAD in kvm_tdp_map_page() and tdx_handle_ept_violation() will *clear* the request, which isn't what we want, e.g. a vCPU could actually re-enter the guest at that point. This is what I'm thinking. If I don't hate it come Friday (or Monday), I'll turn this patch into a mini-series and post v5. Adrian, does that work for you? --- arch/arm64/kvm/arm.c | 2 +- arch/arm64/kvm/vgic/vgic-init.c | 2 +- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 2 -- virt/kvm/kvm_main.c | 10 +++++----- 5 files changed, 8 insertions(+), 10 deletions(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index de2b4e9c9f9f..18bd80388b59 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -1017,7 +1017,7 @@ static int kvm_vcpu_suspend(struct kvm_vcpu *vcpu) static int check_vcpu_requests(struct kvm_vcpu *vcpu) { if (kvm_request_pending(vcpu)) { - if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) + if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) return -EIO; if (kvm_check_request(KVM_REQ_SLEEP, vcpu)) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index eb1205654ac8..c2033bae73b2 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -612,7 +612,7 @@ int kvm_vgic_map_resources(struct kvm *kvm) mutex_unlock(&kvm->arch.config_lock); out_slots: if (ret) - kvm_vm_dead(kvm); + kvm_vm_bugged(kvm); mutex_unlock(&kvm->slots_lock); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b58a74c1722d..37f835d77b65 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10783,7 +10783,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) bool req_immediate_exit = false; if (kvm_request_pending(vcpu)) { - if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) { + if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) { r = -EIO; goto out; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3bde4fb5c6aa..56898e4ab524 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -853,7 +853,6 @@ struct kvm { u32 dirty_ring_size; bool dirty_ring_with_bitmap; bool vm_bugged; - bool vm_dead; #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER struct notifier_block pm_notifier; @@ -893,7 +892,6 @@ struct kvm { static inline void kvm_vm_dead(struct kvm *kvm) { - kvm->vm_dead = true; kvm_make_all_cpus_request(kvm, KVM_REQ_VM_DEAD); } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index eec82775c5bf..4220579a9a74 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4403,7 +4403,7 @@ static long kvm_vcpu_ioctl(struct file *filp, struct kvm_fpu *fpu = NULL; struct kvm_sregs *kvm_sregs = NULL; - if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead) + if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged) return -EIO; if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) @@ -4646,7 +4646,7 @@ static long kvm_vcpu_compat_ioctl(struct file *filp, void __user *argp = compat_ptr(arg); int r; - if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead) + if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged) return -EIO; switch (ioctl) { @@ -4712,7 +4712,7 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl, { struct kvm_device *dev = filp->private_data; - if (dev->kvm->mm != current->mm || dev->kvm->vm_dead) + if (dev->kvm->mm != current->mm || dev->kvm->vm_bugged) return -EIO; switch (ioctl) { @@ -5131,7 +5131,7 @@ static long kvm_vm_ioctl(struct file *filp, void __user *argp = (void __user *)arg; int r; - if (kvm->mm != current->mm || kvm->vm_dead) + if (kvm->mm != current->mm || kvm->vm_bugged) return -EIO; switch (ioctl) { case KVM_CREATE_VCPU: @@ -5395,7 +5395,7 @@ static long kvm_vm_compat_ioctl(struct file *filp, struct kvm *kvm = filp->private_data; int r; - if (kvm->mm != current->mm || kvm->vm_dead) + if (kvm->mm != current->mm || kvm->vm_bugged) return -EIO; r = kvm_arch_vm_compat_ioctl(filp, ioctl, arg); base-commit: 8046d29dde17002523f94d3e6e0ebe486ce52166 --