On Mon, Aug 11, 2025 at 07:02:10PM -0700, Sean Christopherson wrote: > On Mon, Aug 11, 2025, Rick P Edgecombe wrote: > > On Mon, 2025-08-11 at 07:31 +0100, kas@xxxxxxxxxx wrote: > > > > I don't see any other reason for the global spin lock, Kirill was that > > > > it? Did you consider also adding a lock per 2MB region, like the > > > > refcount? Or any other granularity of lock besides global? Not saying > > > > global is definitely the wrong choice, but seems arbitrary if I got the > > > > above right. > > > > > > We have discussed this before[1]. Global locking is problematic when you > > > actually hit contention. Let's not complicate things until we actually > > > see it. I failed to demonstrate contention without huge pages. With huge > > > pages it is even more dubious that we ever see it. > > > > > > [1] > > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@xxxxxxxxx/ > > > > Ah, I see. > > > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge > > How many vCPUs? And were the VMs actually accepting/faulting all 16GiB? > > There's also a noisy neighbor problem lurking. E.g. malicious/buggy VM spams > private<=>shared conversions and thus interferes with PAMT allocations for other > VMs. > > > pages) and then shutting them down. I saw 701 contentions on startup, and 53 > > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either. > > But it probably wasn't much of a cacheline bouncing worse case. > > Isn't the SEAMCALL done while holding the spinlock? I assume the latency of the > SEAMCALL is easily the long pole in the flow. > > > And I guess this is on my latest changes not this exact v2, but it shouldn't > > have changed. > > > > But hmm, it seems Dave's objection about maintaining the lock allocations would > > apply to the refcounts too? But the hotplug concerns shouldn't actually be an > > issue for TDX because they gets rejected if the allocations are not already > > there. So complexity of a per-2MB lock should be minimal, at least > > incrementally. The difference seems more about memory use vs performance. I don't see jump to per-2MB locking remotely justified. We can scale number of locks gradually with the amount of memory in the system: have a power-of-2 set of locks and 2MB range to the lock with %. Note that it is trivial thing to add later on and doesn't need to be part of initial design. > > What gives me pause is in the KVM TDX work we have really tried hard to not take > > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by > > hard numbers. > > Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock > for write is problematic. Even if TDX VMs don't exhibit the same patterns *today* > as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees > that will always hold true. > > > But an enormous amount of work went into lettings KVM faults happen under the > > shared lock for normal VMs. So on one hand, yes it's premature optimization. > > But on the other hand, it's a maintainability concern about polluting the > > existing way things work in KVM with special TDX properties. > > > > I think we need to at least call out loudly that the decision was to go with the > > simplest possible solution, and the impact to KVM. I'm not sure what Sean's > > opinion is, but I wouldn't want him to first learn of it when he went digging > > and found a buried global spin lock in the fault path. > > Heh, too late, I saw it when this was first posted. And to be honest, my initial > reaction was very much "absolutely not" (though Rated R, not PG). Now that I've > had time to think things through, I'm not _totally_ opposed to having a spinlock > in the page fault path, but my overall sentiment remains the same. > > For mmu_lock and related SPTE operations, I was super adamant about not taking > exclusive locks because based on our experience with the TDP MMU, converting flows > from exclusive to shared is usually significantly more work than developing code > for "shared mode" straightaway (and you note above, that wasn't trivial for TDX). > And importantly, those code paths were largely solved problems. I.e. I didn't > want to get into a situation where TDX undid the parallelization of the TDP MMU, > and then had to add it back after the fact. > > I think the same holds true here. I'm not completely opposed to introducing a > spinlock, but I want to either have a very high level of confidence that the lock > won't introduce jitter/delay (I have low confidence on this front, at least in > the proposed patches), or have super clear line of sight to making the contention > irrelevant, without having to rip apart the code. I think there is a big difference with mmu_lock. mmu_lock is analogous to mmap_lock in core-mm. It serializes page fault against other mmu operation and have inherently vast scope. pamt_lock on other hand is at very bottom of callchain and with very limited scope. It is trivially scalable by partitioning. Translating problems you see with mmu_lock onto pamt_lock seems like an overreaction. -- Kiryl Shutsemau / Kirill A. Shutemov