On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Mon, Jul 07, 2025, James Houghton wrote: > > From: Vipin Sharma <vipinsh@xxxxxxxxxx> > > > > Use MMU read lock to recover TDP MMU NX huge pages. Iterate > > Wrap at ~75 chars. Oh, I did indeed wrap the text pretty aggressively for this patch. Not sure why that happened. > > > over the huge pages list under tdp_mmu_pages_lock protection and > > unaccount the page before dropping the lock. > > > > We must not zap an SPTE if: > > No pronouns! Right. > > > - The SPTE is a root page. > > - The SPTE does not point at the SP's page table. > > > > If the SPTE does not point at the SP's page table, then something else > > has change the SPTE, so we cannot safely zap it. > > > > Warn if zapping SPTE fails and current SPTE is still pointing to same > > page table. This should never happen. > > > > There is always a race between dirty logging, vCPU faults, and NX huge > > page recovery for backing a gfn by an NX huge page or an executable > > small page. Unaccounting sooner during the list traversal is increasing > > the window of that race. Functionally, it is okay, because accounting > > doesn't protect against iTLB multi-hit bug, it is there purely to > > prevent KVM from bouncing a gfn between two page sizes. The only > > downside is that a vCPU will end up doing more work in tearing down all > > the child SPTEs. This should be a very rare race. > > > > Zapping under MMU read lock unblocks vCPUs which are waiting for MMU > > read lock. This optimizaion is done to solve a guest jitter issue on > > Windows VM which was observing an increase in network latency. > > With slight tweaking: > > Use MMU read lock to recover TDP MMU NX huge pages. To prevent > concurrent modification of the list of potential huge pages, iterate over > the list under tdp_mmu_pages_lock protection and unaccount the page > before dropping the lock. > > Zapping under MMU read lock unblocks vCPUs which are waiting for MMU > read lock, which solves a guest jitter issue on Windows VMs which were > observing an increase in network latency. > > Do not zap an SPTE if: > - The SPTE is a root page. > - The SPTE does not point at the SP's page table. > > If the SPTE does not point at the SP's page table, then something else > has change the SPTE, so KVM cannot safely zap it. "has changed" (my mistake) > > Warn if zapping SPTE fails and current SPTE is still pointing to same > page table, as it should be impossible for the CMPXCHG to fail due to all > other write scenarios being mutually exclusive. > > There is always a race between dirty logging, vCPU faults, and NX huge > page recovery for backing a gfn by an NX huge page or an executable > small page. Unaccounting sooner during the list traversal increases the > window of that race, but functionally, it is okay. Accounting doesn't > protect against iTLB multi-hit bug, it is there purely to prevent KVM > from bouncing a gfn between two page sizes. The only downside is that a > vCPU will end up doing more work in tearing down all the child SPTEs. > This should be a very rare race. Thanks, this is much better. > > > Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx> > > Co-developed-by: James Houghton <jthoughton@xxxxxxxxxx> > > Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx> > > --- > > arch/x86/kvm/mmu/mmu.c | 107 ++++++++++++++++++++++++------------- > > arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++--- > > 2 files changed, 105 insertions(+), 44 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index b074f7bb5cc58..7df1b4ead705b 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm, > > return ratio ? DIV_ROUND_UP(pages, ratio) : 0; > > } > > > > +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm, > > + struct kvm_mmu_page *sp) > > +{ > > + struct kvm_memory_slot *slot = NULL; > > + > > + /* > > + * Since gfn_to_memslot() is relatively expensive, it helps to skip it if > > + * it the test cannot possibly return true. On the other hand, if any > > + * memslot has logging enabled, chances are good that all of them do, in > > + * which case unaccount_nx_huge_page() is much cheaper than zapping the > > + * page. > > And largely irrelevant, because KVM should unaccount the NX no matter what. I > kinda get what you're saying, but honestly it adds a lot of confusion, especially > since unaccount_nx_huge_page() is in the caller. > > > + * > > + * If a memslot update is in progress, reading an incorrect value of > > + * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming > > + * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming > > + * nonzero, the page will be zapped unnecessarily. Either way, this only > > + * affects efficiency in racy situations, and not correctness. > > + */ > > + if (atomic_read(&kvm->nr_memslots_dirty_logging)) { > > Short-circuit the function to decrease indentation, and so that "slot" doesn't > need to be NULL-initialized. > > > + struct kvm_memslots *slots; > > + > > + slots = kvm_memslots_for_spte_role(kvm, sp->role); > > + slot = __gfn_to_memslot(slots, sp->gfn); > > Then this can be: > > slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn); > > without creating a stupid-long line. > > > + WARN_ON_ONCE(!slot); > > And then: > > if (WARN_ON_ONCE(!slot)) > return false; > > return kvm_slot_dirty_track_enabled(slot); > > With a comment cleanup: > > struct kvm_memory_slot *slot; > > /* > * Skip the memslot lookup if dirty tracking can't possibly be enabled, > * as memslot lookups are relatively expensive. > * > * If a memslot update is in progress, reading an incorrect value of > * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming > * zero, KVM will do an unnecessary memslot lookup; if it is becoming > * nonzero, the page will be zapped unnecessarily. Either way, this > * only affects efficiency in racy situations, and not correctness. > */ > if (!atomic_read(&kvm->nr_memslots_dirty_logging)) > return false; > > slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn); > if (WARN_ON_ONCE(!slot)) > return false; > > return kvm_slot_dirty_track_enabled(slot); LGTM, thanks! > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm, > > rcu_read_lock(); > > > > for ( ; to_zap; --to_zap) { > > - if (list_empty(nx_huge_pages)) > > +#ifdef CONFIG_X86_64 > > These #ifdefs still make me sad, but I also still think they're the least awful > solution. And hopefully we will jettison 32-bit sooner than later :-) Yeah I couldn't come up with anything better. :(