Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

"kas@xxxxxxxxxx" <kas@xxxxxxxxxx> · Tue, 12 Aug 2025 09:04:45 +0100

On Mon, Aug 11, 2025 at 07:31:26PM -0700, Vishal Annapurve wrote:
> On Mon, Aug 11, 2025 at 7:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Mon, Aug 11, 2025, Rick P Edgecombe wrote:
> > > On Mon, 2025-08-11 at 07:31 +0100, kas@xxxxxxxxxx wrote:
> > > > > I don't see any other reason for the global spin lock, Kirill was that
> > > > > it?  Did you consider also adding a lock per 2MB region, like the
> > > > > refcount? Or any other granularity of lock besides global? Not saying
> > > > > global is definitely the wrong choice, but seems arbitrary if I got the
> > > > > above right.
> > > >
> > > > We have discussed this before[1]. Global locking is problematic when you
> > > > actually hit contention. Let's not complicate things until we actually
> > > > see it. I failed to demonstrate contention without huge pages. With huge
> > > > pages it is even more dubious that we ever see it.
> > > >
> > > > [1]
> > > > https://lore.kernel.org/all/4bb2119a-ff6d-42b6-acf4-86d87b0e9939@xxxxxxxxx/
> > >
> > > Ah, I see.
> > >
> > > I just did a test of simultaneously starting 10 VMs with 16GB of ram (non huge
> >
> > How many vCPUs?  And were the VMs actually accepting/faulting all 16GiB?
> >
> > There's also a noisy neighbor problem lurking.  E.g. malicious/buggy VM spams
> > private<=>shared conversions and thus interferes with PAMT allocations for other
> > VMs.
> >
> > > pages) and then shutting them down. I saw 701 contentions on startup, and 53
> > > more on shutdown. Total wait time 2ms. Not horrible but not theoretical either.
> > > But it probably wasn't much of a cacheline bouncing worse case.
> >
> > Isn't the SEAMCALL done while holding the spinlock?  I assume the latency of the
> > SEAMCALL is easily the long pole in the flow.
> >
> > > And I guess this is on my latest changes not this exact v2, but it shouldn't
> > > have changed.
> > >
> > > But hmm, it seems Dave's objection about maintaining the lock allocations would
> > > apply to the refcounts too? But the hotplug concerns shouldn't actually be an
> > > issue for TDX because they gets rejected if the allocations are not already
> > > there. So complexity of a per-2MB lock should be minimal, at least
> > > incrementally. The difference seems more about memory use vs performance.
> > >
> > > What gives me pause is in the KVM TDX work we have really tried hard to not take
> > > exclusive locks in the shared MMU lock path. Admittedly that wasn't backed by
> > > hard numbers.
> >
> > Maybe not for TDX, but we have lots and lots of hard numbers for why taking mmu_lock
> > for write is problematic.  Even if TDX VMs don't exhibit the same patterns *today*
> > as "normal" VMs, i.e. don't suffer the same performance blips, nothing guarantees
> > that will always hold true.
> >
> > > But an enormous amount of work went into lettings KVM faults happen under the
> > > shared lock for normal VMs. So on one hand, yes it's premature optimization.
> > > But on the other hand, it's a maintainability concern about polluting the
> > > existing way things work in KVM with special TDX properties.
> > >
> > > I think we need to at least call out loudly that the decision was to go with the
> > > simplest possible solution, and the impact to KVM. I'm not sure what Sean's
> > > opinion is, but I wouldn't want him to first learn of it when he went digging
> > > and found a buried global spin lock in the fault path.
> >
> > Heh, too late, I saw it when this was first posted.  And to be honest, my initial
> > reaction was very much "absolutely not" (though Rated R, not PG).  Now that I've
> > had time to think things through, I'm not _totally_ opposed to having a spinlock
> > in the page fault path, but my overall sentiment remains the same.
> >
> > For mmu_lock and related SPTE operations, I was super adamant about not taking
> > exclusive locks because based on our experience with the TDP MMU, converting flows
> > from exclusive to shared is usually significantly more work than developing code
> > for "shared mode" straightaway (and you note above, that wasn't trivial for TDX).
> > And importantly, those code paths were largely solved problems.  I.e. I didn't
> > want to get into a situation where TDX undid the parallelization of the TDP MMU,
> > and then had to add it back after the fact.
> >
> > I think the same holds true here.  I'm not completely opposed to introducing a
> > spinlock, but I want to either have a very high level of confidence that the lock
> > won't introduce jitter/delay (I have low confidence on this front, at least in
> > the proposed patches), or have super clear line of sight to making the contention
> > irrelevant, without having to rip apart the code.
> >
> > My biggest question at this point is: why is all of this being done on-demand?
> > IIUC, we swung from "allocate all PAMT_4K pages upfront" to "allocate all PAMT_4K
> > pages at the last possible moment".  Neither of those seems ideal.
> >
> > E.g. for things like TDCS pages and to some extent non-leaf S-EPT pages, on-demand
> > PAMT management seems reasonable.  But for PAMTs that are used to track guest-assigned
> > memory, which is the vaaast majority of PAMT memory, why not hook guest_memfd?
> 
> This seems fine for 4K page backing. But when TDX VMs have huge page
> backing, the vast majority of private memory memory wouldn't need PAMT
> allocation for 4K granularity.
> 
> IIUC guest_memfd allocation happening at 2M granularity doesn't
> necessarily translate to 2M mapping in guest EPT entries. If the DPAMT
> support is to be properly utilized for huge page backings, there is a
> value in not attaching PAMT allocation with guest_memfd allocation.

Right.

It also requires special handling in many places in core-mm. Like, what
happens if THP in guest memfd got split. Who would allocate PAMT for it?
Migration will be more complicated too (when we get there).

-- 
Kiryl Shutsemau / Kirill A. Shutemov