On 20/03/25 02:42, Vishal Annapurve wrote: > On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: >> >> Improve TDX shutdown performance by adding a more efficient shutdown >> operation at the cost of adding separate branches for the TDX MMU >> operations for normal runtime and shutdown. This more efficient method was >> previously used in earlier versions of the TDX patches, but was removed to >> simplify the initial upstreaming. This is an RFC, and still needs a proper >> upstream commit log. It is intended to be an eventual follow up to base >> support. >> >> == Background == >> >> TDX has 2 methods for the host to reclaim guest private memory, depending >> on whether the TD (TDX VM) is in a runnable state or not. These are >> called, respectively: >> 1. Dynamic Page Removal >> 2. Reclaiming TD Pages in TD_TEARDOWN State >> >> Dynamic Page Removal is much slower. Reclaiming a 4K page in TD_TEARDOWN >> state can be 5 times faster, although that is just one part of TD shutdown. >> >> == Relevant TDX Architecture == >> >> Dynamic Page Removal is slow because it has to potentially deal with a >> running TD, and so involves a number of steps: >> Block further address translation >> Exit each VCPU >> Clear Secure EPT entry >> Flush/write-back/invalidate relevant caches >> >> Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown >> procedure (refer tdx_mmu_release_hkid()) has already performed the relevant >> flushing. For details, see TDX Module Base Spec October 2024 sections: >> >> 7.5. TD Teardown Sequence >> 5.6.3. TD Keys Reclamation, TLB and Cache Flush >> >> Essentially all that remains then is to take each page away from the >> TDX Module and return it to the kernel. >> >> == Problem == >> >> Currently, Dynamic Page Removal is being used when the TD is being >> shutdown for the sake of having simpler initial code. >> >> This happens when guest_memfds are closed, refer kvm_gmem_release(). >> guest_memfds hold a reference to struct kvm, so that VM destruction cannot >> happen until after they are released, refer kvm_gmem_release(). >> >> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total >> reclaim time. For example: >> >> VCPUs Size (GB) Before (secs) After (secs) >> 4 18 72 24 >> 32 107 517 134 >> >> Note, the V19 patch set: >> >> https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@xxxxxxxxx/ >> >> did not have this issue because the HKID was released early, something that >> Sean effectively NAK'ed: >> >> "No, the right answer is to not release the HKID until the VM is >> destroyed." >> >> https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@xxxxxxxxxx/ >> >> That was taken on board in the "TDX MMU Part 2" patch set. Refer >> "Moving of tdx_mmu_release_hkid()" in: >> >> https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@xxxxxxxxx/ >> >> == Options == >> >> 1. Start TD teardown earlier so that when pages are removed, >> they can be reclaimed faster. >> 2. Defer page removal until after TD teardown has started. >> 3. A combination of 1 and 2. >> >> Option 1 is problematic because it means putting the TD into a non-runnable >> state while it is potentially still active. Also, as mentioned above, Sean >> effectively NAK'ed it. >> >> Option 2 is possible because the lifetime of guest memory pages is separate >> from guest_memfd (struct kvm_gmem) lifetime. >> >> A reference is taken to pages when they are faulted in, refer >> kvm_gmem_get_pfn(). That reference is not released until the pages are >> removed from the mirror SEPT, refer tdx_unpin(). > > Note that this logic to pin guest memory pages should go away to > support in-place conversion for huge pages [1], though you can still > pin inodes IIUC. > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@xxxxxxxxxxxxxx/ Then the virt code should handle the pinning since it becomes essential. inode->i_mapping->i_private_list can be used because we only need to pin inodes that have no gmem references. Although it is using it as a list entry and elsewhere it is used as a list head. So perhaps like this: diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 0d445a317f61..90e7354d6f40 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -96,6 +96,7 @@ config KVM_INTEL depends on KVM && IA32_FEAT_CTL select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST + select HAVE_KVM_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST help Provides support for KVM on processors equipped with Intel's VT extensions, a.k.a. Virtual Machine Extensions (VMX). diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 1f354b3dbc28..acb022291aec 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -5,6 +5,7 @@ #include <asm/fpu/xcr.h> #include <linux/misc_cgroup.h> #include <linux/mmu_context.h> +#include <linux/fs.h> #include <asm/tdx.h> #include "capabilities.h" #include "mmu.h" @@ -626,6 +627,7 @@ int tdx_vm_init(struct kvm *kvm) kvm->arch.has_protected_state = true; kvm->arch.has_private_mem = true; kvm->arch.disabled_quirks |= KVM_X86_QUIRK_IGNORE_GUEST_PAT; + kvm->gmem_defer_removal = true; /* * Because guest TD is protected, VMM can't parse the instruction in TD. @@ -1839,19 +1841,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn, return tdx_reclaim_page(virt_to_page(private_spt)); } +static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page) +{ + int ret; + + if (level != PG_LEVEL_4K) + return -EINVAL; + + ret = tdx_reclaim_page(page); + if (!ret) + tdx_unpin(kvm, page); + + return ret; +} + int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn) { struct page *page = pfn_to_page(pfn); int ret; - /* - * HKID is released after all private pages have been removed, and set - * before any might be populated. Warn if zapping is attempted when - * there can't be anything populated in the private EPT. - */ - if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm)) - return -EINVAL; + if (!is_hkid_assigned(to_kvm_tdx(kvm))) + return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn)); ret = tdx_sept_zap_private_spte(kvm, gfn, level, page); if (ret <= 0) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ed1968f6f841..2630b88b0983 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -860,6 +860,10 @@ struct kvm { #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES /* Protected by slots_locks (for writes) and RCU (for reads) */ struct xarray mem_attr_array; +#endif +#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL + struct list_head gmem_pinned_inodes; + bool gmem_defer_removal; #endif char stats_id[KVM_STATS_NAME_SIZE]; }; diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 54e959e7d68f..f56a7e89116d 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE config HAVE_KVM_ARCH_GMEM_INVALIDATE bool depends on KVM_PRIVATE_MEM + +config HAVE_KVM_GMEM_DEFER_REMOVAL + bool + depends on KVM_PRIVATE_MEM diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index b2aa6bf24d3a..a673da75a499 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -248,6 +248,36 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, return ret; } +static bool kvm_gmem_defer_removal(struct kvm *kvm, struct kvm_gmem *gmem, + struct inode *inode) +{ +#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL + if (kvm->gmem_defer_removal) { + list_del(&gmem->entry); + if (list_empty(&inode->i_mapping->i_private_list)) { + list_add_tail(&inode->i_mapping->i_private_list, + &kvm->gmem_pinned_inodes); + ihold(inode); + } + return true; + } +#endif + return false; +} + +#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL +void kvm_gmem_unpin_inodes(struct kvm *kvm) +{ + struct address_space *mapping, *n; + + list_for_each_entry_safe(mapping, n, &kvm->gmem_pinned_inodes, + i_private_list) { + list_del_init(&mapping->i_private_list); + iput(mapping->host); + } +} +#endif + static int kvm_gmem_release(struct inode *inode, struct file *file) { struct kvm_gmem *gmem = file->private_data; @@ -275,15 +305,18 @@ static int kvm_gmem_release(struct inode *inode, struct file *file) xa_for_each(&gmem->bindings, index, slot) WRITE_ONCE(slot->gmem.file, NULL); - /* - * All in-flight operations are gone and new bindings can be created. - * Zap all SPTEs pointed at by this file. Do not free the backing - * memory, as its lifetime is associated with the inode, not the file. - */ - kvm_gmem_invalidate_begin(gmem, 0, -1ul); - kvm_gmem_invalidate_end(gmem, 0, -1ul); + if (!kvm_gmem_defer_removal(kvm, gmem, inode)) { + /* + * All in-flight operations are gone and new bindings can be + * created. Zap all SPTEs pointed at by this file. Do not free + * the backing memory, as its lifetime is associated with the + * inode, not the file. + */ + kvm_gmem_invalidate_begin(gmem, 0, -1ul); + kvm_gmem_invalidate_end(gmem, 0, -1ul); - list_del(&gmem->entry); + list_del(&gmem->entry); + } filemap_invalidate_unlock(inode->i_mapping); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 08f237bf4107..5c15828b86fb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1195,6 +1195,10 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) preempt_notifier_inc(); kvm_init_pm_notifier(kvm); +#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL + INIT_LIST_HEAD(&kvm->gmem_pinned_inodes); +#endif + return kvm; out_err_no_debugfs: @@ -1299,6 +1303,9 @@ static void kvm_destroy_vm(struct kvm *kvm) cleanup_srcu_struct(&kvm->srcu); #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES xa_destroy(&kvm->mem_attr_array); +#endif +#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL + kvm_gmem_unpin_inodes(kvm); #endif kvm_arch_free_vm(kvm); preempt_notifier_dec(); diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index acef3f5c582a..59562a355488 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -73,6 +73,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args); int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset); void kvm_gmem_unbind(struct kvm_memory_slot *slot); +#ifdef CONFIG_HAVE_KVM_GMEM_DEFER_REMOVAL +void kvm_gmem_unpin_inodes(struct kvm *kvm); +#endif // HAVE_KVM_GMEM_DEFER_REMOVAL #else static inline void kvm_gmem_init(struct module *module) {