Re: [PATCH v15 11/21] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 22 Jul 2025 07:32:22 -0700

On Tue, Jul 22, 2025, Fuad Tabba wrote:
> On Tue, 22 Jul 2025 at 06:36, Xiaoyao Li <xiaoyao.li@xxxxxxxxx> wrote:
> > - In 0010-KVM-x86-mmu-Rename-.private_max_mapping_level-to-.gm.patch,
> > there is double gmem in the name of vmx/vt 's callback implementation:
> >
> >      vt_gmem_gmem_max_mapping_level
> >      tdx_gmem_gmem_max_mapping_level
> >      vt_op_tdx_only(gmem_gmem_max_mapping_level)
> 
> Sean's patches do that, then he fixes it in a later patch. I'll fix
> this at the source.

Dagnabbit.  I goofed a search+replace, caught it when re-reading things, and
fixed-up the wrong commit.  Sorry :-(

> > - In 0013-KVM-x86-mmu-Extend-guest_memfd-s-max-mapping-level-t.patch,
> >    kvm_x86_call(gmem_max_mapping_level)(...) returns 0 for !private case.
> >    It's not correct though it works without issue currently.
> >
> >    Because current gmem doesn't support hugepage so that the max_level
> >    gotten from gmem is always PG_LEVEL_4K and it returns early in
> >    kvm_gmem_max_mapping_level() on
> >
> >         if (max_level == PG_LEVEL_4K)
> >                 return max_level;
> >
> >    But just look at the following case:
> >
> >      return min(max_level,
> >         kvm_x86_call(gmem_max_mapping_level)(kvm, pfn, is_private));
> >
> >    For non-TDX case and non-SNP case, it will return 0, i.e.
> >    PG_LEVEL_NONE eventually.
> >
> >    so either 1) return PG_LEVEL_NUM/PG_LEVEL_1G for the cases where
> >    .gmem_max_mapping_level callback doesn't have specific restriction.
> >
> >    or 2)
> >
> >         tmp = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn, is_private);
> >         if (tmp)
> >                 return min(max_level, tmp);
> >
> >         return max-level;
> 
> Sean? What do you think?

#2, because KVM uses a "ret0" static call when TDX is disabled (and KVM should
do the same when SEV is disabled, but the SEV #ifdefs are still a bit messy).
Switching to any other value would require adding a VMX stubs for the !TDX case.

I think it makes sense to explicitly call that out as the "CoCo level", to help
unfamiliar readers understand why vendor code has any say in the max
mapping level.

And I would say we adjust max_level instead of having an early return, e.g. to
reduce the probability of future bugs due to adding code between the call to
.gmem_max_mapping_level() and the final return.

This as fixup?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eead5dca6f72..a51013e0992a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3279,9 +3279,9 @@ static u8 kvm_gmem_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fau
                                     const struct kvm_memory_slot *slot, gfn_t gfn,
                                     bool is_private)
 {
+       u8 max_level, coco_level;
        struct page *page;
        kvm_pfn_t pfn;
-       u8 max_level;
 
        /* For faults, use the gmem information that was resolved earlier. */
        if (fault) {
@@ -3305,8 +3305,16 @@ static u8 kvm_gmem_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fau
        if (max_level == PG_LEVEL_4K)
                return max_level;
 
-       return min(max_level,
-                  kvm_x86_call(gmem_max_mapping_level)(kvm, pfn, is_private));
+       /*
+        * CoCo may influence the max mapping level, e.g. due to RMP or S-EPT
+        * restrictions.  A return of '0' means "no additional restrictions",
+        * to allow for using an optional "ret0" static call.
+        */
+       coco_level = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn, is_private);
+       if (coco_level)
+               max_level = min(max_level, coco_level);
+
+       return max_level;
 }
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,