Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Tue, 27 May 2025 09:31:36 +0800

On Sat, May 24, 2025 at 07:40:25AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-05-22 at 11:52 +0800, Yan Zhao wrote:
> > On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> > > On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > > > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > > > to support splitting in the fault path, right?
> > > > 
> > > > I'm OK if we don't care non-Linux TDs for now.
> > > > This can simplify the splitting code and we can add the support when there's a
> > > > need.
> > > 
> > > We do need to care about non-Linux TDs functioning, but we don't need to
> > > optimize for them at this point. We need to optimize for things that happen
> > > often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> > > order to work.
> > > 
> > > Yesterday Kirill and I were chatting offline about the newly defined
> > > TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> > > 1. Guest accepts at 2MB
> > > 2. Guest releases at 2MB (no notice to VMM)
> > > 3. Guest accepts at 4k, EPT violation with expectation to demote
> > > 
> > > In that case, KVM won't know to expect it, and that it needs to preemptively map
> > > things at 4k.
> > > 
> > > For full coverage of the issue, can we discuss a little bit about what demote in
> > > the fault path would look like?
> > For demote in the fault path, it will take mmu read lock.
> > 
> > So, the flow in the fault path is
> > 1. zap with mmu read lock.
> >    ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
> >    if (ret <= 0)
> >        return ret;
> > 2. track with mmu read lock
> >    ret = tdx_track(kvm, true);
> >    if (ret)
> >        return ret;
> > 3. demote with mmu read lock
> >    ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
> >    if (ret)
> >        goto err;
> > 4. return success or unzap as error fallback.
> >    tdx_sept_unzap_private_spte(kvm, gfn, level);
> > 
> > Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
> > will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
> > necessary).
> 
> That is true that it would not be taken very often. It's not a performance
> issue, but I think we should not add a lock if we can at all avoid it. It
> creates a special case for TDX for the TDP MMU. People would have to then keep
> in mind that two mmu read lock threads could still still contend.
Hmm, without the kvm_tdx->sept_lock, we can return retry if busy error is
returned from tdh_mem_range_block(). However, we need to ensure the success of
tdh_mem_range_unblock() before completing the split.

Besides, we need the kvm_tdx->track_lock to serialize tdh_mem_track() and
kicking off vCPUs. In the base series, we use write kvm->mmu_lock to achieve
this purpose.

BTW: Looks Kirill's DPAMT series will introduce a pamt_lock [1]. 
[1] https://lore.kernel.org/all/20250502130828.4071412-6-kirill.shutemov@xxxxxxxxxxxxxxx/

> > > The current zapping operation that is involved
> > > depends on mmu write lock. And I remember you had a POC that added essentially a
> > > hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
> > Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
> > in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
> > tdh_mem_range_unblock().
> > 
> > The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
> > module.
> > 
> >   Resources          SHARED  users              EXCLUSIVE users 
> > -----------------------------------------------------------------------
> >  secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
> >                    tdh_mem_page_aug            tdh_mem_sept_remove
> >                    tdh_mem_page_remove         tdh_mem_range_block
> >                    tdh_mem_page_promote        tdh_mem_range_unblock
> >                    tdh_mem_page_demote
> > 
> > > the fault path demote case could actually handle failure. So if we just returned
> > > busy and didn't try to force the retry, we would just run the risk of
> > > interfering with TDX module sept lock? Is that the only issue with a design that
> > > would allows failure of demote in the fault path?
> > The concern to support split in the fault path is mainly to avoid unnecesssary
> > split, e.g., when two vCPUs try to accept at different levels.
> 
> We are just talking about keeping rare TDs functional here, right? Two cases
> are:
>  - TDs using PAGE.RELEASE
This is for future linux TDs, right?

>  - TDs using pending #VEs and accepting memory in strange patterns
> 
> Not maintaining huge pages there seems totally acceptable. How I look at this
> whole thing is that it just an optimization, not a feature. Every aspect has a
> complexity/performance tradeoff that we need to make a sensible decision on.
> Maintaining huge page mappings in every possible case is not the goal.
So, can I interpret your preference as follows?
For now,
- Do not support huge pages on non-linux TDs.
- Do not support page splitting in fault path.

> > 
> > Besides that we need to introduce 3 locks inside TDX:
> > rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.
> 
> Huh?
In the base series, no_vcpu_enter_lock and track_lock are saved by holding the
write kvm->mmu_lock.

> 
> > 
> > To ensure the success of unzap (to restore the state), kicking of vCPUs in the
> > fault path is required, which is not ideal. But with the introduced lock and the
> > proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
> > the chance to invoke unzap is very low.
> 
> Yes, it's probably not safe to expect the exact same demote call chain again.
> The fault path could maybe learn to recover from the blocked state?
Do you mean you want to introduce a blocked state in the mirror page table?
I don't like it for its complexity.

Do you think we can try to ask for tdh_mem_page_demote() not to use
tdh_mem_range_block() and tdh_mem_range_unblock(). Looks it's anyway required
for TDX connect.

If that's true, the tdh_mem_range_{un}block()/tdh_mem_track() can be avoided in
the fault path.

> > 
> > > Let's keep in mind that we could ask for TDX module changes to enable this path.
> > We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
> > an non-ACCEPTable entry to avoid contention with guest and the potential error
> > TDX_HOST_PRIORITY_BUSY_TIMEOUT.
> 
> Part of that is already in the works (accepting not-present entries). It seems
> reasonable. But also, what about looking at having the TDX module do the full
> demote operation internally. The track part obviously happens outside of the TDX
> module, but maybe the whole thing could be simplified.
> 
> > 
> > > I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> > > a plan to fix it up with TDX module changes. And if the ultimate root cause of
> > > the complication is avoiding zero-step (sept lock), we should fix that instead
> > > of design around it further.
> > Ok.
> > 
> > > > 
> 
> I'll respond to the error code half of this mail separately.