Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Wed, 2 Jul 2025 00:18:48 +0000

On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > Then (3) could be skipped in the case of ability to demote under read lock?
> > 
> > I noticed that the other lpage_info updaters took mmu write lock, and I
> > wasn't
> > sure why. We shouldn't take a lock that we don't actually need just for
> > safety
> > margin or to copy other code.
> Use write mmu_lock is of reason.
> 
> In the 3 steps, 
> 1. set lpage_info
> 2. demote if needed
> 3. go to fault handler
> 
> Step 2 requires holding write mmu_lock before invoking
> kvm_split_boundary_leafs().
> The write mmu_lock is also possible to get dropped and re-acquired in
> kvm_split_boundary_leafs() for purpose like memory allocation.
> 
> If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> 2MB
> level after the demote in step 2.
> Luckily, current TDX doesn't support promotion now.
> But we can avoid wading into this complex situation by holding write mmu_lock
> in 1.

I don't think because some code might race in the future is a good reason to
take the write lock.

> 
> > > > For most accept
> > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > we
> > > > could
> > > 
> > > The actual mapping at 4KB level is still with read mmu_lock in
> > > __vmx_handle_ept_violation().
> > > 
> > > > have a demote that could work under read lock, as we talked. So
> > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > lock
> > > > when needed.
> > > Could we leave the "demote under read lock" as a future optimization? 
> > 
> > We could add it to the list. If we have a TDX module that supports demote
> > with a
> > single SEAMCALL then we don't have the rollback problem. The optimization
> > could
> > utilize that. That said, we should focus on the optimizations that make the
> > biggest difference to real TDs. Your data suggests this might not be the
> > case
> > today. 
> Ok. 
>  
> > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > write
> > > > lock?
> > > As above, to avoid the other vCPUs reading stale mapping level and
> > > splitting
> > > under read mmu_lock.
> > 
> > We need mmu write lock for demote, but as long as the order is:
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> > 
> > Then (3) should have what it needs even if another fault races (1).
> See the above comment for why we need to hold write mmu_lock for 1.
> 
> Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> before
> walking the SPTEs to check if there's any existing mapping), I don't see any
> performance impact by holding write mmu_lock for 1.

It's maintainability problem too. Someday someone may want to remove it and
scratch their head for what race they are missing.

> 
> 
> > > As guest_inhibit is set one-way, we could test it using
> > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > write
> > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > (in my testing, 11 during VM boot).
> > >  
> > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > seem
> > > > like
> > > > there are any big downsides. Should we close it?
> > > I think it's good, as long as Sean doesn't disagree :)
> > 
> > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > locking
> > on v2.
> Ok. I'll use write mmu_lock for updating lpage_info in v2 first.

Specifically, why do the other lpage_info updating functions take mmu write
lock. Are you sure there is no other reason?