Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Wed, 2 Jul 2025 09:07:09 +0800

On Wed, Jul 02, 2025 at 08:18:48AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote:
> > > Then (3) could be skipped in the case of ability to demote under read lock?
> > > 
> > > I noticed that the other lpage_info updaters took mmu write lock, and I
> > > wasn't
> > > sure why. We shouldn't take a lock that we don't actually need just for
> > > safety
> > > margin or to copy other code.
> > Use write mmu_lock is of reason.
> > 
> > In the 3 steps, 
> > 1. set lpage_info
> > 2. demote if needed
> > 3. go to fault handler
> > 
> > Step 2 requires holding write mmu_lock before invoking
> > kvm_split_boundary_leafs().
> > The write mmu_lock is also possible to get dropped and re-acquired in
> > kvm_split_boundary_leafs() for purpose like memory allocation.
> > 
> > If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at
> > 2MB
> > level after the demote in step 2.
> > Luckily, current TDX doesn't support promotion now.
> > But we can avoid wading into this complex situation by holding write mmu_lock
> > in 1.
> 
> I don't think because some code might race in the future is a good reason to
> take the write lock.

I still prefer to hold write mmu_lock right now.

Otherwise, we at least need to convert disallow_lpage to atomic variable and
updating it via an atomic way, e.g. cmpxchg. 

struct kvm_lpage_info {
        int disallow_lpage;
};

> > > > > For most accept
> > > > > cases, we could fault in the PTE's on the read lock. And in the future
> > > > > we
> > > > > could
> > > > 
> > > > The actual mapping at 4KB level is still with read mmu_lock in
> > > > __vmx_handle_ept_violation().
> > > > 
> > > > > have a demote that could work under read lock, as we talked. So
> > > > > kvm_split_boundary_leafs() often or could be unneeded or work under read
> > > > > lock
> > > > > when needed.
> > > > Could we leave the "demote under read lock" as a future optimization? 
> > > 
> > > We could add it to the list. If we have a TDX module that supports demote
> > > with a
> > > single SEAMCALL then we don't have the rollback problem. The optimization
> > > could
> > > utilize that. That said, we should focus on the optimizations that make the
> > > biggest difference to real TDs. Your data suggests this might not be the
> > > case
> > > today. 
> > Ok. 
> >  
> > > > > What is the problem in hugepage_set_guest_inhibit() that requires the
> > > > > write
> > > > > lock?
> > > > As above, to avoid the other vCPUs reading stale mapping level and
> > > > splitting
> > > > under read mmu_lock.
> > > 
> > > We need mmu write lock for demote, but as long as the order is:
> > > 1. set lpage_info
> > > 2. demote if needed
> > > 3. go to fault handler
> > > 
> > > Then (3) should have what it needs even if another fault races (1).
> > See the above comment for why we need to hold write mmu_lock for 1.
> > 
> > Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock
> > before
> > walking the SPTEs to check if there's any existing mapping), I don't see any
> > performance impact by holding write mmu_lock for 1.
> 
> It's maintainability problem too. Someday someone may want to remove it and
> scratch their head for what race they are missing.
I don't get why holding write mmu_lock will cause maintainability problem.
In contrast, if we want to use read mmu_lock in future, we need to carefully
check if there's any potential risk.

> > > > As guest_inhibit is set one-way, we could test it using
> > > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold
> > > > write
> > > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
> > > > (in my testing, 11 during VM boot).
> > > >  
> > > > > But in any case, it seems like we have *a* solution here. It doesn't
> > > > > seem
> > > > > like
> > > > > there are any big downsides. Should we close it?
> > > > I think it's good, as long as Sean doesn't disagree :)
> > > 
> > > He seemed onboard. Let's close it. We can even discuss lpage_info update
> > > locking
> > > on v2.
> > Ok. I'll use write mmu_lock for updating lpage_info in v2 first.
> 
> Specifically, why do the other lpage_info updating functions take mmu write
> lock. Are you sure there is no other reason?
1. The read mmu_lock can't prevent the other vCPUs from reading stale lpage_info.
2. Shadow code in KVM MMU only holds write mmu_lock, so it updates lpage_info
   with write mmu_lock.
3. lpage_info is not updated atomically. If there're two vCPUs updating
   lpage_info concurrently, lpage_info may hold an invalid value.
4. lpage_info is not updated in performance critical paths. No need to hold
   read mmu_lock.