Re: [PATCH v16 15/22] KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Fri, 25 Jul 2025 12:34:32 -0700

Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Fri, Jul 25, 2025, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
>> 
>> > On Thu, Jul 24, 2025, Ackerley Tng wrote:
>> >> Fuad Tabba <tabba@xxxxxxxxxx> writes:
>> >> >  int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
>> >> > @@ -3362,8 +3371,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
>> >> >  	if (max_level == PG_LEVEL_4K)
>> >> >  		return PG_LEVEL_4K;
>> >> >  
>> >> > -	if (is_private)
>> >> > -		host_level = kvm_max_private_mapping_level(kvm, fault, slot, gfn);
>> >> > +	if (is_private || kvm_memslot_is_gmem_only(slot))
>> >> > +		host_level = kvm_gmem_max_mapping_level(kvm, fault, slot, gfn,
>> >> > +							is_private);
>> >> >  	else
>> >> >  		host_level = host_pfn_mapping_level(kvm, gfn, slot);
>> >> 
>> >> No change required now, would like to point out that in this change
>> >> there's a bit of an assumption if kvm_memslot_is_gmem_only(), even for
>> >> shared pages, guest_memfd will be the only source of truth.
>> >
>> > It's not an assumption, it's a hard requirement.
>> >
>> >> This holds now because shared pages are always split to 4K, but if
>> >> shared pages become larger, might mapping in the host actually turn out
>> >> to be smaller?
>> >
>> > Yes, the host userspace mappens could be smaller, and supporting that scenario is
>> > very explicitly one of the design goals of guest_memfd.  From commit a7800aa80ea4
>> > ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory"):
>> >
>> >  : A guest-first memory subsystem allows for optimizations and enhancements
>> >  : that are kludgy or outright infeasible to implement/support in a generic
>> >  : memory subsystem.  With guest_memfd, guest protections and mapping sizes
>> >  : are fully decoupled from host userspace mappings.   E.g. KVM currently
>> >  : doesn't support mapping memory as writable in the guest without it also
>> >  : being writable in host userspace, as KVM's ABI uses VMA protections to
>> >  : define the allow guest protection.  Userspace can fudge this by
>> >  : establishing two mappings, a writable mapping for the guest and readable
>> >  : one for itself, but that’s suboptimal on multiple fronts.
>> >  : 
>> >  : Similarly, KVM currently requires the guest mapping size to be a strict
>> >  : subset of the host userspace mapping size, e.g. KVM doesn’t support
>> >  : creating a 1GiB guest mapping unless userspace also has a 1GiB guest
>> >  : mapping.  Decoupling the mappings sizes would allow userspace to precisely
>> >  : map only what is needed without impacting guest performance, e.g. to
>> >  : harden against unintentional accesses to guest memory.
>> 
>> Let me try to understand this better. If/when guest_memfd supports
>> larger folios for shared pages, and guest_memfd returns a 2M folio from
>> kvm_gmem_fault_shared(), can the mapping in host userspace turn out
>> to be 4K?
>
> It can be 2M, 4K, or none.
>
>> If that happens, should kvm_gmem_max_mapping_level() return 4K for a
>> memslot with kvm_memslot_is_gmem_only() == true?
>
> No.
>
>> The above code would skip host_pfn_mapping_level() and return just what
>> guest_memfd reports, which is 2M.
>
> Yes.
>
>> Or do you mean that guest_memfd will be the source of truth in that it
>> must also know/control, in the above scenario, that the host mapping is
>> also 2M?
>
> No.  The userspace mapping, _if_ there is one, is completely irrelevant.  The
> entire point of guest_memfd is eliminate the requirement that memory be mapped
> into host userspace in order for that memory to be mapped into the guest.
>

If it's not mapped into the host at all, host_pfn_mapping_level() would
default to 4K and I think that's a safe default.

> Invoking host_pfn_mapping_level() isn't just undesirable, it's flat out wrong, as
> KVM will not verify slot->userspace_addr actually points at the (same) guest_memfd
> instance.
>

This is true too, that invoking host_pfn_mapping_level() could return
totally wrong information if slot->userspace_addr points somewhere else
completely.

What if slot->userspace_addr is set up to match the fd+offset in the
same guest_memfd, and kvm_gmem_max_mapping_level() returns 2M but it's
actually mapped into the host at 4K?

A little out of my depth here, but would mappings being recovered to the
2M level be a problem?

For enforcement of shared/private-ness of memory, recovering the
mappings to the 2M level is okay since if some part had been private,
guest_memfd wouldn't have returned 2M.

As for alignment, if guest_memfd could return 2M to
kvm_gmem_max_mapping_level(), then userspace_addr would have been 2M
aligned, which would correctly permit mapping recovery to 2M, so that
sounds like it works too.

Maybe the right solution here is that since slot->userspace_addr need
not point at the same guest_memfd+offset configured in the memslot, when
guest_memfd responds to kvm_gmem_max_mapping_level(), it should check if
the requested GFN is mapped in host userspace, and if so, return the
smaller of the two mapping levels.

> To demonstrate, this must pass (and does once "KVM: x86/mmu: Handle guest page
> faults for guest_memfd with shared memory" is added back).
>

Makes sense :)

[snip]