On Sat, May 24, 2025 at 07:40:25AM +0800, Edgecombe, Rick P wrote: > On Thu, 2025-05-22 at 11:52 +0800, Yan Zhao wrote: > > On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote: > > > On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote: > > > > So, you want to disallow huge pages for non-Linux TDs, then we have no need > > > > to support splitting in the fault path, right? > > > > > > > > I'm OK if we don't care non-Linux TDs for now. > > > > This can simplify the splitting code and we can add the support when there's a > > > > need. > > > > > > We do need to care about non-Linux TDs functioning, but we don't need to > > > optimize for them at this point. We need to optimize for things that happen > > > often. Pending-#VE using TDs are rare, and don't need to have huge pages in > > > order to work. > > > > > > Yesterday Kirill and I were chatting offline about the newly defined > > > TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is: > > > 1. Guest accepts at 2MB > > > 2. Guest releases at 2MB (no notice to VMM) > > > 3. Guest accepts at 4k, EPT violation with expectation to demote > > > > > > In that case, KVM won't know to expect it, and that it needs to preemptively map > > > things at 4k. > > > > > > For full coverage of the issue, can we discuss a little bit about what demote in > > > the fault path would look like? > > For demote in the fault path, it will take mmu read lock. > > > > So, the flow in the fault path is > > 1. zap with mmu read lock. > > ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true); > > if (ret <= 0) > > return ret; > > 2. track with mmu read lock > > ret = tdx_track(kvm, true); > > if (ret) > > return ret; > > 3. demote with mmu read lock > > ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true); > > if (ret) > > goto err; > > 4. return success or unzap as error fallback. > > tdx_sept_unzap_private_spte(kvm, gfn, level); > > > > Steps 1-3 will return -EBUSY on busy error (which will not be very often as we > > will introduce kvm_tdx->sept_lock. I can post the full lock analysis if > > necessary). > > That is true that it would not be taken very often. It's not a performance > issue, but I think we should not add a lock if we can at all avoid it. It > creates a special case for TDX for the TDP MMU. People would have to then keep > in mind that two mmu read lock threads could still still contend. Hmm, without the kvm_tdx->sept_lock, we can return retry if busy error is returned from tdh_mem_range_block(). However, we need to ensure the success of tdh_mem_range_unblock() before completing the split. Besides, we need the kvm_tdx->track_lock to serialize tdh_mem_track() and kicking off vCPUs. In the base series, we use write kvm->mmu_lock to achieve this purpose. BTW: Looks Kirill's DPAMT series will introduce a pamt_lock [1]. [1] https://lore.kernel.org/all/20250502130828.4071412-6-kirill.shutemov@xxxxxxxxxxxxxxx/ > > > The current zapping operation that is involved > > > depends on mmu write lock. And I remember you had a POC that added essentially a > > > hidden exclusive lock in TDX code as a substitute. But unlike the other callers, > > Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held > > in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(), > > tdh_mem_range_unblock(). > > > > The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX > > module. > > > > Resources SHARED users EXCLUSIVE users > > ----------------------------------------------------------------------- > > secure_ept_lock tdh_mem_sept_add tdh_vp_enter > > tdh_mem_page_aug tdh_mem_sept_remove > > tdh_mem_page_remove tdh_mem_range_block > > tdh_mem_page_promote tdh_mem_range_unblock > > tdh_mem_page_demote > > > > > the fault path demote case could actually handle failure. So if we just returned > > > busy and didn't try to force the retry, we would just run the risk of > > > interfering with TDX module sept lock? Is that the only issue with a design that > > > would allows failure of demote in the fault path? > > The concern to support split in the fault path is mainly to avoid unnecesssary > > split, e.g., when two vCPUs try to accept at different levels. > > We are just talking about keeping rare TDs functional here, right? Two cases > are: > - TDs using PAGE.RELEASE This is for future linux TDs, right? > - TDs using pending #VEs and accepting memory in strange patterns > > Not maintaining huge pages there seems totally acceptable. How I look at this > whole thing is that it just an optimization, not a feature. Every aspect has a > complexity/performance tradeoff that we need to make a sensible decision on. > Maintaining huge page mappings in every possible case is not the goal. So, can I interpret your preference as follows? For now, - Do not support huge pages on non-linux TDs. - Do not support page splitting in fault path. > > > > Besides that we need to introduce 3 locks inside TDX: > > rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock. > > Huh? In the base series, no_vcpu_enter_lock and track_lock are saved by holding the write kvm->mmu_lock. > > > > > To ensure the success of unzap (to restore the state), kicking of vCPUs in the > > fault path is required, which is not ideal. But with the introduced lock and the > > proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment), > > the chance to invoke unzap is very low. > > Yes, it's probably not safe to expect the exact same demote call chain again. > The fault path could maybe learn to recover from the blocked state? Do you mean you want to introduce a blocked state in the mirror page table? I don't like it for its complexity. Do you think we can try to ask for tdh_mem_page_demote() not to use tdh_mem_range_block() and tdh_mem_range_unblock(). Looks it's anyway required for TDX connect. If that's true, the tdh_mem_range_{un}block()/tdh_mem_track() can be avoided in the fault path. > > > > > Let's keep in mind that we could ask for TDX module changes to enable this path. > > We may need TDX module's change to let tdg_mem_page_accept() not to take lock on > > an non-ACCEPTable entry to avoid contention with guest and the potential error > > TDX_HOST_PRIORITY_BUSY_TIMEOUT. > > Part of that is already in the works (accepting not-present entries). It seems > reasonable. But also, what about looking at having the TDX module do the full > demote operation internally. The track part obviously happens outside of the TDX > module, but maybe the whole thing could be simplified. > > > > > > I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had > > > a plan to fix it up with TDX module changes. And if the ultimate root cause of > > > the complication is avoiding zero-step (sept lock), we should fix that instead > > > of design around it further. > > Ok. > > > > > > > > I'll respond to the error code half of this mail separately.