Re: [PATCH v16 15/22] KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Fri, 25 Jul 2025 09:40:48 -0700

Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Thu, Jul 24, 2025, Ackerley Tng wrote:
>> Fuad Tabba <tabba@xxxxxxxxxx> writes:
>> >  int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
>> > @@ -3362,8 +3371,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
>> >  	if (max_level == PG_LEVEL_4K)
>> >  		return PG_LEVEL_4K;
>> >  
>> > -	if (is_private)
>> > -		host_level = kvm_max_private_mapping_level(kvm, fault, slot, gfn);
>> > +	if (is_private || kvm_memslot_is_gmem_only(slot))
>> > +		host_level = kvm_gmem_max_mapping_level(kvm, fault, slot, gfn,
>> > +							is_private);
>> >  	else
>> >  		host_level = host_pfn_mapping_level(kvm, gfn, slot);
>> 
>> No change required now, would like to point out that in this change
>> there's a bit of an assumption if kvm_memslot_is_gmem_only(), even for
>> shared pages, guest_memfd will be the only source of truth.
>
> It's not an assumption, it's a hard requirement.
>
>> This holds now because shared pages are always split to 4K, but if
>> shared pages become larger, might mapping in the host actually turn out
>> to be smaller?
>
> Yes, the host userspace mappens could be smaller, and supporting that scenario is
> very explicitly one of the design goals of guest_memfd.  From commit a7800aa80ea4
> ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory"):
>
>  : A guest-first memory subsystem allows for optimizations and enhancements
>  : that are kludgy or outright infeasible to implement/support in a generic
>  : memory subsystem.  With guest_memfd, guest protections and mapping sizes
>  : are fully decoupled from host userspace mappings.   E.g. KVM currently
>  : doesn't support mapping memory as writable in the guest without it also
>  : being writable in host userspace, as KVM's ABI uses VMA protections to
>  : define the allow guest protection.  Userspace can fudge this by
>  : establishing two mappings, a writable mapping for the guest and readable
>  : one for itself, but that’s suboptimal on multiple fronts.
>  : 
>  : Similarly, KVM currently requires the guest mapping size to be a strict
>  : subset of the host userspace mapping size, e.g. KVM doesn’t support
>  : creating a 1GiB guest mapping unless userspace also has a 1GiB guest
>  : mapping.  Decoupling the mappings sizes would allow userspace to precisely
>  : map only what is needed without impacting guest performance, e.g. to
>  : harden against unintentional accesses to guest memory.

Let me try to understand this better. If/when guest_memfd supports
larger folios for shared pages, and guest_memfd returns a 2M folio from
kvm_gmem_fault_shared(), can the mapping in host userspace turn out
to be 4K?

If that happens, should kvm_gmem_max_mapping_level() return 4K for a
memslot with kvm_memslot_is_gmem_only() == true?

The above code would skip host_pfn_mapping_level() and return just what
guest_memfd reports, which is 2M.

Or do you mean that guest_memfd will be the source of truth in that it
must also know/control, in the above scenario, that the host mapping is
also 2M?