On Wed, Apr 23, 2025 at 11:17:54AM -0700, Ackerley Tng wrote: > Michael Roth <michael.roth@xxxxxxx> writes: > > > On Fri, Apr 18, 2025 at 08:13:17AM -0700, Sean Christopherson wrote: > >> > >> That all looks good to me. And to ensure we don't go off the rails due to bad > >> inputs (which are supposed to be fully validated by common KVM), we could add a > >> WARN to detect a non-exclusive range->end. > >> > >> So this? > >> > >> if (WARN_ON_ONCE(range->end <= range->start)) > >> return false; > >> > >> /* > >> * If the head and tail pages of the range currently allow a hugepage, > >> * i.e. reside fully in the slot and don't have mixed attributes, then > >> * add each corresponding hugepage range to the ongoing invalidation, > >> * e.g. to prevent KVM from creating a hugepage in response to a fault > >> * for a gfn whose attributes aren't changing. Note, only the range > >> * of gfns whose attributes are being modified needs to be explicitly > >> * unmapped, as that will unmap any existing hugepages. > >> */ > >> for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) { > >> gfn_t start = gfn_round_for_level(range->start, level); > >> gfn_t end = gfn_round_for_level(range->end - 1, level); > >> gfn_t nr_pages = KVM_PAGES_PER_HPAGE(level); > >> > >> if ((start != range->start || start + nr_pages > range->end) && > >> start >= slot->base_gfn && > >> start + nr_pages <= slot->base_gfn + slot->npages && > >> !hugepage_test_mixed(slot, start, level)) > >> kvm_mmu_invalidate_range_add(kvm, start, start + nr_pages); > >> > >> if (end == start) > >> continue; > >> > >> if ((end + nr_pages) <= (slot->base_gfn + slot->npages) && > >> !hugepage_test_mixed(slot, end, level)) > >> kvm_mmu_invalidate_range_add(kvm, end, end + nr_pages); > >> } > > > > Looks good! Re-tested with this version of the patch and it seems to address > > the original issue. > > > > Tested-by: Michael Roth <michael.roth@xxxxxxx> > > > > Thanks, > > > > Mike > > Would you be able to share how you set up a test to trigger this issue? > Thanks in advance! The original issue was triggered when running SNP guests using 2MB hugetlbfs pages for the shared memory, and it would happen when mem_encrypt_free_decrypted_mem() was called. In this case 20KB of 2MB BSS decrypted region is used, and the issue was triggering an infinite loop when the guest tries to re-encrypt the memory as part of freeing it back to kernel: [ 36.521488] Freeing unused decrypted memory: 2028K I don't think it is specific to hugetlbfs, since the below hack can be used to trigger the race fairly reliably even without hugetlbfs. I noticed that pretty much every guest vCPU will fault on the same GPA after the initial invalidation, and think it might be kvmclock that's using that GPA. So the below just expands the race window to make sure one of those other vCPUs triggers a fault after the shared->private conversion invalidates a 2MB NPT entry. I suspect hugetlbfs was originally a factor because it was using pre-allocated pages so maybe it able to re-install the zapped entry more quickly vs. the normal flow where the shared page would get hole-punched by QEMU and so would need to get re-allocated. -Mike diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index de2c11dae231..27fdf05c79eb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2457,6 +2457,8 @@ static bool kvm_pre_set_memory_attributes(struct kvm *kvm, return kvm_arch_pre_set_memory_attributes(kvm, range); } +#include <linux/delay.h> + /* Set @attributes for the gfn range [@start, @end). */ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, unsigned long attributes) @@ -2500,6 +2502,11 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, } kvm_handle_gfn_range(kvm, &pre_set_range); + /* page offset 5 corresponds to the first PSC for the BSS section in question. */ + if ((start & 0x1FF) == 5) { + pr_warn("%s: prolonging GFN invalidation start %llx end %llx attributes %lx", __func__, start, end, attributes); + msleep(2000); + } for (i = start; i < end; i++) { r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,