Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

James Houghton <jthoughton@xxxxxxxxxx> · Mon, 28 Jul 2025 11:07:37 -0700

On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Jul 07, 2025, James Houghton wrote:
> > From: Vipin Sharma <vipinsh@xxxxxxxxxx>
> >
> > Use MMU read lock to recover TDP MMU NX huge pages. Iterate
>
> Wrap at ~75 chars.

Oh, I did indeed wrap the text pretty aggressively for this patch. Not
sure why that happened.

>
> > over the huge pages list under tdp_mmu_pages_lock protection and
> > unaccount the page before dropping the lock.
> >
> > We must not zap an SPTE if:
>
> No pronouns!

Right.

>
> > - The SPTE is a root page.
> > - The SPTE does not point at the SP's page table.
> >
> > If the SPTE does not point at the SP's page table, then something else
> > has change the SPTE, so we cannot safely zap it.
> >
> > Warn if zapping SPTE fails and current SPTE is still pointing to same
> > page table. This should never happen.
> >
> > There is always a race between dirty logging, vCPU faults, and NX huge
> > page recovery for backing a gfn by an NX huge page or an executable
> > small page. Unaccounting sooner during the list traversal is increasing
> > the window of that race. Functionally, it is okay, because accounting
> > doesn't protect against iTLB multi-hit bug, it is there purely to
> > prevent KVM from bouncing a gfn between two page sizes. The only
> > downside is that a vCPU will end up doing more work in tearing down all
> > the child SPTEs. This should be a very rare race.
> >
> > Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> > read lock. This optimizaion is done to solve a guest jitter issue on
> > Windows VM which was observing an increase in network latency.
>
> With slight tweaking:
>
> Use MMU read lock to recover TDP MMU NX huge pages.  To prevent
> concurrent modification of the list of potential huge pages, iterate over
> the list under tdp_mmu_pages_lock protection and unaccount the page
> before dropping the lock.
>
> Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> read lock, which solves a guest jitter issue on Windows VMs which were
> observing an increase in network latency.
>
> Do not zap an SPTE if:
> - The SPTE is a root page.
> - The SPTE does not point at the SP's page table.
>
> If the SPTE does not point at the SP's page table, then something else
> has change the SPTE, so KVM cannot safely zap it.

"has changed" (my mistake)

>
> Warn if zapping SPTE fails and current SPTE is still pointing to same
> page table, as it should be impossible for the CMPXCHG to fail due to all
> other write scenarios being mutually exclusive.
>
> There is always a race between dirty logging, vCPU faults, and NX huge
> page recovery for backing a gfn by an NX huge page or an executable
> small page.  Unaccounting sooner during the list traversal increases the
> window of that race, but functionally, it is okay.  Accounting doesn't
> protect against iTLB multi-hit bug, it is there purely to prevent KVM
> from bouncing a gfn between two page sizes. The only  downside is that a
> vCPU will end up doing more work in tearing down all  the child SPTEs.
> This should be a very rare race.

Thanks, this is much better.

>
> > Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx>
> > Co-developed-by: James Houghton <jthoughton@xxxxxxxxxx>
> > Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     | 107 ++++++++++++++++++++++++-------------
> >  arch/x86/kvm/mmu/tdp_mmu.c |  42 ++++++++++++---
> >  2 files changed, 105 insertions(+), 44 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b074f7bb5cc58..7df1b4ead705b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
> >       return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
> >  }
> >
> > +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> > +                                          struct kvm_mmu_page *sp)
> > +{
> > +     struct kvm_memory_slot *slot = NULL;
> > +
> > +     /*
> > +      * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
> > +      * it the test cannot possibly return true.  On the other hand, if any
> > +      * memslot has logging enabled, chances are good that all of them do, in
> > +      * which case unaccount_nx_huge_page() is much cheaper than zapping the
> > +      * page.
>
> And largely irrelevant, because KVM should unaccount the NX no matter what.  I
> kinda get what you're saying, but honestly it adds a lot of confusion, especially
> since unaccount_nx_huge_page() is in the caller.
>
> > +      *
> > +      * If a memslot update is in progress, reading an incorrect value of
> > +      * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
> > +      * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
> > +      * nonzero, the page will be zapped unnecessarily.  Either way, this only
> > +      * affects efficiency in racy situations, and not correctness.
> > +      */
> > +     if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
>
> Short-circuit the function to decrease indentation, and so that "slot" doesn't
> need to be NULL-initialized.
>
> > +             struct kvm_memslots *slots;
> > +
> > +             slots = kvm_memslots_for_spte_role(kvm, sp->role);
> > +             slot = __gfn_to_memslot(slots, sp->gfn);
>
> Then this can be:
>
>         slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
>
> without creating a stupid-long line.
>
> > +             WARN_ON_ONCE(!slot);
>
> And then:
>
>         if (WARN_ON_ONCE(!slot))
>                 return false;
>
>         return kvm_slot_dirty_track_enabled(slot);
>
> With a comment cleanup:
>
>         struct kvm_memory_slot *slot;
>
>         /*
>          * Skip the memslot lookup if dirty tracking can't possibly be enabled,
>          * as memslot lookups are relatively expensive.
>          *
>          * If a memslot update is in progress, reading an incorrect value of
>          * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
>          * zero, KVM will  do an unnecessary memslot lookup;  if it is becoming
>          * nonzero, the page will be zapped unnecessarily.  Either way, this
>          * only affects efficiency in racy situations, and not correctness.
>          */
>         if (!atomic_read(&kvm->nr_memslots_dirty_logging))
>                 return false;
>
>         slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
>         if (WARN_ON_ONCE(!slot))
>                 return false;
>
>         return kvm_slot_dirty_track_enabled(slot);

LGTM, thanks!

> > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> >       rcu_read_lock();
> >
> >       for ( ; to_zap; --to_zap) {
> > -             if (list_empty(nx_huge_pages))
> > +#ifdef CONFIG_X86_64
>
> These #ifdefs still make me sad, but I also still think they're the least awful
> solution.  And hopefully we will jettison 32-bit sooner than later :-)

Yeah I couldn't come up with anything better. :(