On Mon, Aug 11, 2025 at 07:31:26PM -0700, Vishal Annapurve wrote: > On Mon, Aug 11, 2025 at 7:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Mon, Aug 11, 2025, Rick P Edgecombe wrote: > > > On Mon, 2025-08-11 at 07:31 +0100, kas@xxxxxxxxxx wrote: > > > > > I don't see any other reason for the global spin lock, Kirill was that > > > > > it? Did you consider also adding a lock per 2MB region, like the > > > > > refcount? Or any other granularity of lock besides global? Not saying > > > > > global is definitely the wrong choice, but seems arbitrary if I got the > > > > > above right. > > > > > > > > We have discussed this before[1]. Global locking is problematic when you > > > > actually hit contention. Let's not complicate things until we actually > > > > see it. I failed to demonstrate contention without huge pages. With huge > > > > pages it is even more dubious that we ever see it. > > > > > > > > [1] > > > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@xxxxxxxxx/ > > > > > > Ah, I see. > > > > > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge > > > > How many vCPUs? And were the VMs actually accepting/faulting all 16GiB? > > > > There's also a noisy neighbor problem lurking. E.g. malicious/buggy VM spams > > private<=>shared conversions and thus interferes with PAMT allocations for other > > VMs. > > > > > pages) and then shutting them down. I saw 701 contentions on startup, and 53 > > > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either. > > > But it probably wasn't much of a cacheline bouncing worse case. > > > > Isn't the SEAMCALL done while holding the spinlock? I assume the latency of the > > SEAMCALL is easily the long pole in the flow. > > > > > And I guess this is on my latest changes not this exact v2, but it shouldn't > > > have changed. > > > > > > But hmm, it seems Dave's objection about maintaining the lock allocations would > > > apply to the refcounts too? But the hotplug concerns shouldn't actually be an > > > issue for TDX because they gets rejected if the allocations are not already > > > there. So complexity of a per-2MB lock should be minimal, at least > > > incrementally. The difference seems more about memory use vs performance. > > > > > > What gives me pause is in the KVM TDX work we have really tried hard to not take > > > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by > > > hard numbers. > > > > Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock > > for write is problematic. Even if TDX VMs don't exhibit the same patterns *today* > > as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees > > that will always hold true. > > > > > But an enormous amount of work went into lettings KVM faults happen under the > > > shared lock for normal VMs. So on one hand, yes it's premature optimization. > > > But on the other hand, it's a maintainability concern about polluting the > > > existing way things work in KVM with special TDX properties. > > > > > > I think we need to at least call out loudly that the decision was to go with the > > > simplest possible solution, and the impact to KVM. I'm not sure what Sean's > > > opinion is, but I wouldn't want him to first learn of it when he went digging > > > and found a buried global spin lock in the fault path. > > > > Heh, too late, I saw it when this was first posted. And to be honest, my initial > > reaction was very much "absolutely not" (though Rated R, not PG). Now that I've > > had time to think things through, I'm not _totally_ opposed to having a spinlock > > in the page fault path, but my overall sentiment remains the same. > > > > For mmu_lock and related SPTE operations, I was super adamant about not taking > > exclusive locks because based on our experience with the TDP MMU, converting flows > > from exclusive to shared is usually significantly more work than developing code > > for "shared mode" straightaway (and you note above, that wasn't trivial for TDX). > > And importantly, those code paths were largely solved problems. I.e. I didn't > > want to get into a situation where TDX undid the parallelization of the TDP MMU, > > and then had to add it back after the fact. > > > > I think the same holds true here. I'm not completely opposed to introducing a > > spinlock, but I want to either have a very high level of confidence that the lock > > won't introduce jitter/delay (I have low confidence on this front, at least in > > the proposed patches), or have super clear line of sight to making the contention > > irrelevant, without having to rip apart the code. > > > > My biggest question at this point is: why is all of this being done on-demand? > > IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K > > pages at the last possible moment". Neither of those seems ideal. > > > > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand > > PAMT management seems reasonable. But for PAMTs that are used to track guest-assigned > > memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd? > > This seems fine for 4K page backing. But when TDX VMs have huge page > backing, the vast majority of private memory memory wouldn't need PAMT > allocation for 4K granularity. > > IIUC guest_memfd allocation happening at 2M granularity doesn't > necessarily translate to 2M mapping in guest EPT entries. If the DPAMT > support is to be properly utilized for huge page backings, there is a > value in not attaching PAMT allocation with guest_memfd allocation. Right. It also requires special handling in many places in core-mm. Like, what happens if THP in guest memfd got split. Who would allocate PAMT for it? Migration will be more complicated too (when we get there). -- Kiryl Shutsemau / Kirill A. Shutemov