On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote: > > Then (3) could be skipped in the case of ability to demote under read lock? > > > > I noticed that the other lpage_info updaters took mmu write lock, and I > > wasn't > > sure why. We shouldn't take a lock that we don't actually need just for > > safety > > margin or to copy other code. > Use write mmu_lock is of reason. > > In the 3 steps, > 1. set lpage_info > 2. demote if needed > 3. go to fault handler > > Step 2 requires holding write mmu_lock before invoking > kvm_split_boundary_leafs(). > The write mmu_lock is also possible to get dropped and re-acquired in > kvm_split_boundary_leafs() for purpose like memory allocation. > > If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at > 2MB > level after the demote in step 2. > Luckily, current TDX doesn't support promotion now. > But we can avoid wading into this complex situation by holding write mmu_lock > in 1. I don't think because some code might race in the future is a good reason to take the write lock. > > > > > For most accept > > > > cases, we could fault in the PTE's on the read lock. And in the future > > > > we > > > > could > > > > > > The actual mapping at 4KB level is still with read mmu_lock in > > > __vmx_handle_ept_violation(). > > > > > > > have a demote that could work under read lock, as we talked. So > > > > kvm_split_boundary_leafs() often or could be unneeded or work under read > > > > lock > > > > when needed. > > > Could we leave the "demote under read lock" as a future optimization? > > > > We could add it to the list. If we have a TDX module that supports demote > > with a > > single SEAMCALL then we don't have the rollback problem. The optimization > > could > > utilize that. That said, we should focus on the optimizations that make the > > biggest difference to real TDs. Your data suggests this might not be the > > case > > today. > Ok. > > > > > What is the problem in hugepage_set_guest_inhibit() that requires the > > > > write > > > > lock? > > > As above, to avoid the other vCPUs reading stale mapping level and > > > splitting > > > under read mmu_lock. > > > > We need mmu write lock for demote, but as long as the order is: > > 1. set lpage_info > > 2. demote if needed > > 3. go to fault handler > > > > Then (3) should have what it needs even if another fault races (1). > See the above comment for why we need to hold write mmu_lock for 1. > > Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock > before > walking the SPTEs to check if there's any existing mapping), I don't see any > performance impact by holding write mmu_lock for 1. It's maintainability problem too. Someday someone may want to remove it and scratch their head for what race they are missing. > > > > > As guest_inhibit is set one-way, we could test it using > > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold > > > write > > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced. > > > (in my testing, 11 during VM boot). > > > > > > > But in any case, it seems like we have *a* solution here. It doesn't > > > > seem > > > > like > > > > there are any big downsides. Should we close it? > > > I think it's good, as long as Sean doesn't disagree :) > > > > He seemed onboard. Let's close it. We can even discuss lpage_info update > > locking > > on v2. > Ok. I'll use write mmu_lock for updating lpage_info in v2 first. Specifically, why do the other lpage_info updating functions take mmu write lock. Are you sure there is no other reason?