On Wed, Jul 02, 2025 at 08:18:48AM +0800, Edgecombe, Rick P wrote: > On Wed, 2025-07-02 at 08:12 +0800, Yan Zhao wrote: > > > Then (3) could be skipped in the case of ability to demote under read lock? > > > > > > I noticed that the other lpage_info updaters took mmu write lock, and I > > > wasn't > > > sure why. We shouldn't take a lock that we don't actually need just for > > > safety > > > margin or to copy other code. > > Use write mmu_lock is of reason. > > > > In the 3 steps, > > 1. set lpage_info > > 2. demote if needed > > 3. go to fault handler > > > > Step 2 requires holding write mmu_lock before invoking > > kvm_split_boundary_leafs(). > > The write mmu_lock is also possible to get dropped and re-acquired in > > kvm_split_boundary_leafs() for purpose like memory allocation. > > > > If 1 is with read mmu_lock, the other vCPUs is still possible to fault in at > > 2MB > > level after the demote in step 2. > > Luckily, current TDX doesn't support promotion now. > > But we can avoid wading into this complex situation by holding write mmu_lock > > in 1. > > I don't think because some code might race in the future is a good reason to > take the write lock. I still prefer to hold write mmu_lock right now. Otherwise, we at least need to convert disallow_lpage to atomic variable and updating it via an atomic way, e.g. cmpxchg. struct kvm_lpage_info { int disallow_lpage; }; > > > > > For most accept > > > > > cases, we could fault in the PTE's on the read lock. And in the future > > > > > we > > > > > could > > > > > > > > The actual mapping at 4KB level is still with read mmu_lock in > > > > __vmx_handle_ept_violation(). > > > > > > > > > have a demote that could work under read lock, as we talked. So > > > > > kvm_split_boundary_leafs() often or could be unneeded or work under read > > > > > lock > > > > > when needed. > > > > Could we leave the "demote under read lock" as a future optimization? > > > > > > We could add it to the list. If we have a TDX module that supports demote > > > with a > > > single SEAMCALL then we don't have the rollback problem. The optimization > > > could > > > utilize that. That said, we should focus on the optimizations that make the > > > biggest difference to real TDs. Your data suggests this might not be the > > > case > > > today. > > Ok. > > > > > > > What is the problem in hugepage_set_guest_inhibit() that requires the > > > > > write > > > > > lock? > > > > As above, to avoid the other vCPUs reading stale mapping level and > > > > splitting > > > > under read mmu_lock. > > > > > > We need mmu write lock for demote, but as long as the order is: > > > 1. set lpage_info > > > 2. demote if needed > > > 3. go to fault handler > > > > > > Then (3) should have what it needs even if another fault races (1). > > See the above comment for why we need to hold write mmu_lock for 1. > > > > Besides, as we need write mmu_lock anyway for 2 (i.e. hold write mmu_lock > > before > > walking the SPTEs to check if there's any existing mapping), I don't see any > > performance impact by holding write mmu_lock for 1. > > It's maintainability problem too. Someday someone may want to remove it and > scratch their head for what race they are missing. I don't get why holding write mmu_lock will cause maintainability problem. In contrast, if we want to use read mmu_lock in future, we need to carefully check if there's any potential risk. > > > > As guest_inhibit is set one-way, we could test it using > > > > hugepage_test_guest_inhibit() without holding the lock. The chance to hold > > > > write > > > > mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced. > > > > (in my testing, 11 during VM boot). > > > > > > > > > But in any case, it seems like we have *a* solution here. It doesn't > > > > > seem > > > > > like > > > > > there are any big downsides. Should we close it? > > > > I think it's good, as long as Sean doesn't disagree :) > > > > > > He seemed onboard. Let's close it. We can even discuss lpage_info update > > > locking > > > on v2. > > Ok. I'll use write mmu_lock for updating lpage_info in v2 first. > > Specifically, why do the other lpage_info updating functions take mmu write > lock. Are you sure there is no other reason? 1. The read mmu_lock can't prevent the other vCPUs from reading stale lpage_info. 2. Shadow code in KVM MMU only holds write mmu_lock, so it updates lpage_info with write mmu_lock. 3. lpage_info is not updated atomically. If there're two vCPUs updating lpage_info concurrently, lpage_info may hold an invalid value. 4. lpage_info is not updated in performance critical paths. No need to hold read mmu_lock.