On Wed, Sep 03, 2025 at 03:36:49PM +0800, Binbin Wu wrote: > > > On 8/7/2025 5:44 PM, Yan Zhao wrote: > > TDX requires guests to accept S-EPT mappings created by the host KVM. Due > > to the current implementation of the TDX module, if a guest accepts a GFN > > at a lower level after KVM maps it at a higher level, the TDX module will > > emulate an EPT violation VMExit to KVM instead of returning a size mismatch > > error to the guest. If KVM fails to perform page splitting in the VMExit > > handler, the guest's accept operation will be triggered again upon > > re-entering the guest, causing a repeated EPT violation VMExit. > > > > The TDX module thus enables the EPT violation VMExit to carry the guest's > > accept level when the VMExit is caused by the guest's accept operation. > > > > Therefore, in TDX's EPT violation handler > > (1) Set the guest inhibit bit in the lpage info to prevent KVM MMU core > > from mapping at a higher a level than the guest's accept level. > > > > (2) Split any existing huge mapping at the fault GFN to avoid unsupported > > splitting under the shared mmu_lock by TDX. > > > > Use write mmu_lock to pretect (1) and (2) for now. If future KVM TDX can > > perform the actual splitting under shared mmu_lock with enhanced TDX > > modules, (1) is possible to be called under shared mmu_lock, and (2) would > > become unnecessary. > > The description for (1) and (2) reversed? No. After supporting splitting under shared mmu_lock, - setting guest inhibit bit can be performed under shared mmu_lock. (*) - splitting existing huge mapping under write mmu_lock here would be unnecessary. (*) is still required to convey the info of which max level the guest requires. (as explained in "Open 1: How to pass guest's ACCEPT level info" in the cover letter). > > As an optimization, this patch calls hugepage_test_guest_inhibit() without > > holding the mmu_lock to reduce the frequency of acquiring the write > > mmu_lock. The write mmu_lock is thus only acquired if the guest inhibit bit > > is not already set. This is safe because the guest inhibit bit is set in a > > one-way manner while the splitting under the write mmu_lock is performed > > before setting the guest inhibit bit. > > > > Link: https://lore.kernel.org/all/a6ffe23fb97e64109f512fa43e9f6405236ed40a.camel@xxxxxxxxx > > Suggested-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> > > Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > > --- > > RFC v2 > > - Change tdx_get_accept_level() to tdx_check_accept_level(). > > - Invoke kvm_split_cross_boundary_leafs() and hugepage_set_guest_inhibit() > > to change KVM mapping level in a global way according to guest accept > > level. (Rick, Sean). > > > > RFC v1: > > - Introduce tdx_get_accept_level() to get guest accept level. > > - Use tdx->violation_request_level and tdx->violation_gfn* to pass guest > > accept level to tdx_gmem_private_max_mapping_level() to detemine KVM > > mapping level. > > --- > > arch/x86/kvm/vmx/tdx.c | 50 +++++++++++++++++++++++++++++++++++++ > > arch/x86/kvm/vmx/tdx_arch.h | 3 +++ > > 2 files changed, 53 insertions(+) > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > index 035d81275be4..71115058e5e6 100644 > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -2019,6 +2019,53 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp > > return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN); > > } > > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gfn_t gfn) > > +{ > > + struct kvm_memory_slot *slot = gfn_to_memslot(vcpu->kvm, gfn); > > + struct vcpu_tdx *tdx = to_tdx(vcpu); > > + struct kvm *kvm = vcpu->kvm; > > + u64 eeq_type, eeq_info; > > + int level = -1; > > + > > + if (!slot) > > + return 0; > > + > > + eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK; > > + if (eeq_type != TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) > > + return 0; > > + > > + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >> > > + TDX_EXT_EXIT_QUAL_INFO_SHIFT; > > + > > + level = (eeq_info & GENMASK(2, 0)) + 1; > > + > > + if (level == PG_LEVEL_4K || level == PG_LEVEL_2M) { > > + if (!hugepage_test_guest_inhibit(slot, gfn, level + 1)) { > > + gfn_t base_gfn = gfn_round_for_level(gfn, level); > > + struct kvm_gfn_range gfn_range = { > > + .start = base_gfn, > > + .end = base_gfn + KVM_PAGES_PER_HPAGE(level), > > + .slot = slot, > > + .may_block = true, > > + .attr_filter = KVM_FILTER_PRIVATE, > > + }; > > + > > + scoped_guard(write_lock, &kvm->mmu_lock) { > > + int ret; > > + > > + ret = kvm_split_cross_boundary_leafs(kvm, &gfn_range, false); > > + if (ret) > > + return ret; > > kvm_split_cross_boundary_leafs() calls kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), which could return flush as 1 if any of the huge page crossing boundary is split, return directly when ret is non-zero seems not right. Also, the TLB flush should also be taken care because in kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(), TLB flush is only done for negative return value. Oh, good catch! I forgot about the 2 facts. Will fix them. > > + > > + hugepage_set_guest_inhibit(slot, gfn, level + 1); > > + if (level == PG_LEVEL_4K) > > + hugepage_set_guest_inhibit(slot, gfn, level + 2); > > + } > > + } > > + } > > + return 0; > > +} > > + > > static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > { > > unsigned long exit_qual; > > @@ -2044,6 +2091,9 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > */ > > exit_qual = EPT_VIOLATION_ACC_WRITE; > > + if (tdx_check_accept_level(vcpu, gpa_to_gfn(gpa))) > > + return RET_PF_RETRY; > > + > > /* Only private GPA triggers zero-step mitigation */ > > local_retry = true; > > } else { > > diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h > > index a30e880849e3..af006a73ee05 100644 > > --- a/arch/x86/kvm/vmx/tdx_arch.h > > +++ b/arch/x86/kvm/vmx/tdx_arch.h > > @@ -82,7 +82,10 @@ struct tdx_cpuid_value { > > #define TDX_TD_ATTR_PERFMON BIT_ULL(63) > > #define TDX_EXT_EXIT_QUAL_TYPE_MASK GENMASK(3, 0) > > +#define TDX_EXT_EXIT_QUAL_TYPE_ACCEPT 1 > > #define TDX_EXT_EXIT_QUAL_TYPE_PENDING_EPT_VIOLATION 6 > > +#define TDX_EXT_EXIT_QUAL_INFO_MASK GENMASK(63, 32) > > +#define TDX_EXT_EXIT_QUAL_INFO_SHIFT 32 > > /* > > * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B. > > */ >