Re: [PATCH v15 11/21] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level

Xiaoyao Li <xiaoyao.li@xxxxxxxxx> · Tue, 22 Jul 2025 13:35:53 +0800

On 7/22/2025 7:17 AM, Sean Christopherson wrote:
On Fri, Jul 18, 2025, Xiaoyao Li wrote:
On 7/18/2025 12:27 AM, Fuad Tabba wrote:
From: Ackerley Tng <ackerleytng@xxxxxxxxxx>

Refactor kvm_max_private_mapping_level() to accept a NULL kvm_page_fault
pointer and rename it to kvm_gmem_max_mapping_level().

The max_mapping_level x86 operation (previously private_max_mapping_level)
is designed to potentially be called without an active page fault, for
instance, when kvm_mmu_max_mapping_level() is determining the maximum
mapping level for a gfn proactively.

Allow NULL fault pointer: Modify kvm_max_private_mapping_level() to
safely handle a NULL fault argument. This aligns its interface with the
kvm_x86_ops.max_mapping_level operation it wraps, which can also be
called with NULL.

are you sure of it?

The patch 09 just added the check of fault->is_private for TDX and SEV.

+1, this isn't quite right.  That's largely my fault (no pun intended) though, as
I suggested the basic gist of the NULL @fault handling, and it's a mess.  More at
the bottom.

Rename function to kvm_gmem_max_mapping_level(): This reinforces that
the function's scope is for guest_memfd-backed memory, which can be
either private or non-private, removing any remaining "private"
connotation from its name.

Optimize max_level checks: Introduce a check in the caller to skip
querying for max_mapping_level if the current max_level is already
PG_LEVEL_4K, as no further reduction is possible.

Acked-by: David Hildenbrand <david@xxxxxxxxxx>
Suggested-by: Sean Christoperson <seanjc@xxxxxxxxxx>
Signed-off-by: Ackerley Tng <ackerleytng@xxxxxxxxxx>
Signed-off-by: Fuad Tabba <tabba@xxxxxxxxxx>
---
   arch/x86/kvm/mmu/mmu.c | 16 +++++++---------
   1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bb925994cbc5..6bd28fda0fd3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4467,17 +4467,13 @@ static inline u8 kvm_max_level_for_order(int order)
   	return PG_LEVEL_4K;
   }
-static u8 kvm_max_private_mapping_level(struct kvm *kvm,
-					struct kvm_page_fault *fault,
-					int gmem_order)
+static u8 kvm_gmem_max_mapping_level(struct kvm *kvm, int order,
+				     struct kvm_page_fault *fault)
   {
-	u8 max_level = fault->max_level;
   	u8 req_max_level;
+	u8 max_level;
-	if (max_level == PG_LEVEL_4K)
-		return PG_LEVEL_4K;
-
-	max_level = min(kvm_max_level_for_order(gmem_order), max_level);
+	max_level = kvm_max_level_for_order(order);
   	if (max_level == PG_LEVEL_4K)
   		return PG_LEVEL_4K;
@@ -4513,7 +4509,9 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
   	}
   	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
-	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault, max_order);
+	if (fault->max_level >= PG_LEVEL_4K)
+		fault->max_level = kvm_gmem_max_mapping_level(vcpu->kvm,
+							      max_order, fault);

I cannot understand why this change is required. In what case will
fault->max_level < PG_LEVEL_4K?

Yeah, I don't get this code either.  I also don't think KVM should call
kvm_gmem_max_mapping_level() *here*.  That's mostly a problem with my suggested
NULL @fault handling.  Dealing with kvm_gmem_max_mapping_level() here leads to
weirdness, because kvm_gmem_max_mapping_level() also needs to be invoked for the
!fault path, and then we end up with multiple call sites and the potential for a
redundant call (gmem only, is private).

Looking through surrounding patches, the ordering of things is also "off".
"Generalize private_max_mapping_level x86 op to max_mapping_level" should just
rename the helper; reacting to !is_private memory in TDX belongs in "Consult
guest_memfd when computing max_mapping_level", because that's where KVM plays
nice with non-private memory.

But that patch is also doing too much, e.g. shuffling code around and short-circuting
the non-fault case, which makes it confusing and hard to review.  Extending gmem
hugepage support to shared memory should be "just" this:

@@ -3335,8 +3336,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
         if (max_level == PG_LEVEL_4K)
                 return PG_LEVEL_4K;
  
-       if (is_private)
-               host_level = kvm_max_private_mapping_level(kvm, fault, slot, gfn);
+       if (is_private || kvm_memslot_is_gmem_only(slot))
+               host_level = kvm_gmem_max_mapping_level(kvm, fault, slot, gfn,
+                                                       is_private);
         else
                 host_level = host_pfn_mapping_level(kvm, gfn, slot);
         return min(host_level, max_level);

plus the plumbing and the small TDX change.  All the renames and code shuffling
should be done in prep patches.

The attached patches are compile-tested only, but I think they get use where we
want to be, and without my confusing suggestion to try and punt on private mappings
in the hugepage recovery paths.  They should slot it at the right patch numbers
(relative to v15).

Holler if the patches don't work, I'm happy to help sort things out so that v16
is ready to go.

I have some feedback though the attached patches function well.

- In 0010-KVM-x86-mmu-Rename-.private_max_mapping_level-to-.gm.patch, 
there is double gmem in the name of vmx/vt 's callback implementation:

    vt_gmem_gmem_max_mapping_level
    tdx_gmem_gmem_max_mapping_level
    vt_op_tdx_only(gmem_gmem_max_mapping_level)

- In 0013-KVM-x86-mmu-Extend-guest_memfd-s-max-mapping-level-t.patch,
  kvm_x86_call(gmem_max_mapping_level)(...) returns 0 for !private case.
  It's not correct though it works without issue currently.

  Because current gmem doesn't support hugepage so that the max_level
  gotten from gmem is always PG_LEVEL_4K and it returns early in
  kvm_gmem_max_mapping_level() on

	if (max_level == PG_LEVEL_4K)
		return max_level;

  But just look at the following case:

    return min(max_level,
	kvm_x86_call(gmem_max_mapping_level)(kvm, pfn, is_private));

  For non-TDX case and non-SNP case, it will return 0, i.e.
  PG_LEVEL_NONE eventually.

  so either 1) return PG_LEVEL_NUM/PG_LEVEL_1G for the cases where
  .gmem_max_mapping_level callback doesn't have specific restriction.

  or 2)

	tmp = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn, is_private);
	if (tmp)
		return min(max_level, tmp);

	return max-level;