On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote: > When populating the initial memory image for a TDX guest, ADD pages to the > TD as part of establishing the mappings in the mirror EPT, as opposed to > creating the mappings and then doing ADD after the fact. Doing ADD in the > S-EPT callbacks eliminates the need to track "premapped" pages, as the > mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails, > KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT). > > Eliminating the hole where the M-EPT can have a mapping that doesn't exist > in the S-EPT in turn obviates the need to handle errors that are unique to > encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()). > > Keeping the M-EPT and S-EPT synchronized also eliminates the need to check > for unconsumed "premap" entries during tdx_td_finalize(), as there simply > can't be any such entries. Dropping that check in particular reduces the > overall cognitive load, as the managemented of nr_premapped with respect > to removal of S-EPT is _very_ subtle. E.g. successful removal of an S-EPT > entry after it completed ADD doesn't adjust nr_premapped, but it's not > clear why that's "ok" but having half-baked entries is not (it's not truly > "ok" in that removing pages from the image will likely prevent the guest > from booting, but from KVM's perspective it's "ok"). > > Doing ADD in the S-EPT path requires passing an argument via a scratch > field, but the current approach of tracking the number of "premapped" > pages effectively does the same. And the "premapped" counter is much more > dangerous, as it doesn't have a singular lock to protect its usage, since > nr_premapped can be modified as soon as mmu_lock is dropped, at least in > theory. I.e. nr_premapped is guarded by slots_lock, but only for "happy" > paths. > > Note, this approach was used/tried at various points in TDX development, > but was ultimately discarded due to a desire to avoid stashing temporary > state in kvm_tdx. But as above, KVM ended up with such state anyways, > and fully committing to using temporary state provides better access > rules (100% guarded by slots_lock), and makes several edge cases flat out > impossible. > > Note #2, continue to extend the measurement outside of mmu_lock, as it's > a slow operation (typically 16 SEAMCALLs per page whose data is included > in the measurement), and doesn't *need* to be done under mmu_lock, e.g. > for consistency purposes. However, MR.EXTEND isn't _that_ slow, e.g. > ~1ms latency to measure a full page, so if it needs to be done under > mmu_lock in the future, e.g. because KVM gains a flow that can remove > S-EPT entries uring KVM_TDX_INIT_MEM_REGION, then extending the ^using > measurement can also be moved into the S-EPT mapping path (again, only if > absolutely necessary). P.S. _If_ MR.EXTEND is moved into the S-EPT path, > take care not to return an error up the stack if TDH_MR_EXTEND fails, as > removing the M-EPT entry but not the S-EPT entry would result in > inconsistent state! > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- Reviewed-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> But some possible unintended changed below. > arch/x86/kvm/vmx/tdx.c | 116 ++++++++++++++--------------------------- > arch/x86/kvm/vmx/tdx.h | 8 ++- > 2 files changed, 46 insertions(+), 78 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index bc92e87a1dbb..00c3dc376690 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -1586,6 +1586,32 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level) > td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa); > } > > +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level, > + kvm_pfn_t pfn) > +{ > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > + u64 err, entry, level_state; > + gpa_t gpa = gfn_to_gpa(gfn); > + > + lockdep_assert_held(&kvm->slots_lock); > + > + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) || > + KVM_BUG_ON(!kvm_tdx->page_add_src, kvm)) > + return -EIO; > + > + err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), > + kvm_tdx->page_add_src, &entry, &level_state); > + if (unlikely(tdx_operand_busy(err))) > + return -EBUSY; > + > + if (KVM_BUG_ON(err, kvm)) { > + pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state); > + return -EIO; > + } > + > + return 0; > +} > + > static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, > enum pg_level level, kvm_pfn_t pfn) > { > @@ -1627,19 +1653,10 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > /* > * If the TD isn't finalized/runnable, then userspace is initializing > - * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of > - * pages that need to be mapped and initialized via TDH.MEM.PAGE.ADD. > - * KVM_TDX_FINALIZE_VM checks the counter to ensure all mapped pages > - * have been added to the image, to prevent running the TD with a > - * valid mapping in the mirror EPT, but not in the S-EPT. > + * the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD. > */ > - if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) { > - if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) > - return -EIO; > - > - atomic64_inc(&kvm_tdx->nr_premapped); > - return 0; > - } > + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) > + return tdx_mem_page_add(kvm, gfn, level, pfn); > > return tdx_mem_page_aug(kvm, gfn, level, pfn); > } > @@ -1665,39 +1682,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn, > return 0; > } > > -/* > - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is > - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called > - * successfully. > - * > - * Since tdh_mem_sept_add() must have been invoked successfully before a > - * non-leaf entry present in the mirrored page table, the SEPT ZAP related > - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead > - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the > - * SEPT. > - * > - * Further check if the returned entry from SEPT walking is with RWX permissions > - * to filter out anything unexpected. > - * > - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from > - * level_state returned from a SEAMCALL error is the same as that passed into > - * the SEAMCALL. > - */ > -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err, > - u64 entry, int level) > -{ > - if (!err || kvm_tdx->state == TD_STATE_RUNNABLE) > - return false; > - > - if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) > - return false; > - > - if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK))) > - return false; > - > - return true; > -} > - > static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, > enum pg_level level, struct page *page) > { > @@ -1717,12 +1701,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, > err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state); > tdx_no_vcpus_enter_stop(kvm); > } > - if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) { > - if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm)) > - return -EIO; > - > - return 0; > - } > > if (KVM_BUG_ON(err, kvm)) { > pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state); > @@ -2827,12 +2805,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd) > > if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) > return -EINVAL; > - /* > - * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue > - * TDH.MEM.PAGE.ADD(). > - */ > - if (atomic64_read(&kvm_tdx->nr_premapped)) > - return -EINVAL; > > cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td); > if (tdx_operand_busy(cmd->hw_error)) > @@ -3116,11 +3088,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > { > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > - u64 err, entry, level_state; > gpa_t gpa = gfn_to_gpa(gfn); > + u64 err, entry, level_state; Fine, but ? > struct page *src_page; > int ret, i; > > + if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > + return -EIO; > + > /* > * Get the source page if it has been faulted in. Return failure if the > * source page has been swapped out or unmapped in primary memory. > @@ -3131,22 +3106,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > if (ret != 1) > return -ENOMEM; > > + kvm_tdx->page_add_src = src_page; > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > - if (ret < 0) > - goto out; > + kvm_tdx->page_add_src = NULL; > > - ret = 0; > - err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), > - src_page, &entry, &level_state); > - if (err) { > - ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; > - goto out; > - } > + put_page(src_page); > > - KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm); > - > - if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > - goto out; > + if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > + return ret; > > /* > * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed > @@ -3159,14 +3126,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); > if (KVM_BUG_ON(err, kvm)) { > pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state); > - ret = -EIO; > - goto out; > + return -EIO; > } > } > > -out: > - put_page(src_page); > - return ret; > + return 0; > } > > static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h > index ca39a9391db1..1b00adbbaf77 100644 > --- a/arch/x86/kvm/vmx/tdx.h > +++ b/arch/x86/kvm/vmx/tdx.h > @@ -36,8 +36,12 @@ struct kvm_tdx { > > struct tdx_td td; > > - /* For KVM_TDX_INIT_MEM_REGION. */ > - atomic64_t nr_premapped; > + /* > + * Scratch pointer used to pass the source page to tdx_mem_page_add. > + * Protected by slots_lock, and non-NULL only when mapping a private > + * pfn via tdx_gmem_post_populate(). > + */ > + struct page *page_add_src; > > /* > * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do