Re: [PATCH v16 15/22] KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Fri, 25 Jul 2025 15:25:58 -0700

Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Fri, Jul 25, 2025, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
>> 
>> > On Fri, Jul 25, 2025, Ackerley Tng wrote:
>> >> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
>> >> > Invoking host_pfn_mapping_level() isn't just undesirable, it's flat out wrong, as
>> >> > KVM will not verify slot->userspace_addr actually points at the (same) guest_memfd
>> >> > instance.
>> >> >
>> >> 
>> >> This is true too, that invoking host_pfn_mapping_level() could return
>> >> totally wrong information if slot->userspace_addr points somewhere else
>> >> completely.
>> >> 
>> >> What if slot->userspace_addr is set up to match the fd+offset in the
>> >> same guest_memfd, and kvm_gmem_max_mapping_level() returns 2M but it's
>> >> actually mapped into the host at 4K?
>> >> 
>> >> A little out of my depth here, but would mappings being recovered to the
>> >> 2M level be a problem?
>> >
>> > No, because again, by design, the host userspace mapping has _zero_ influence on
>> > the guest mapping.
>> 
>> Not trying to solve any problem but mostly trying to understand mapping
>> levels better.
>> 
>> Before guest_memfd, why does kvm_mmu_max_mapping_level() need to do
>> host_pfn_mapping_level()?
>> 
>> Was it about THP folios?
>
> And HugeTLB, and Device DAX, and probably at least one other type of backing at
> this point.
>
> Without guest_memfd, guest mappings are a strict subset of the host userspace
> mappings for the associated address space (i.e. process) (ignoring that the guest
> and host mappings are separate page tables).
>
> When mapping memory into the guest, KVM manages a Secondary MMU (in mmu_notifier
> parlance), where the Primary MMU is managed by mm/, and is for all intents and
> purposes synonymous with the address space of the userspace VMM.
>
> To get a pfn to insert into the Secondary MMU's PTEs (SPTE, which was originally
> "shadow PTEs", but has been retrofitted to "secondary PTEs" so that it's not an
> outright lie when using stage-2 page tables), the pfn *must* be faulted into and
> mapped in the Primary MMU.  I.e. under no circumstance can a SPTE point at memory
> that isn't mapped into the Primary MMU.
>
> Side note, except for VM_EXEC, protections for Secondary MMU mappings must also
> be a strict subset of the Primary MMU's mappings.  E.g. KVM can't create a
> WRITABLE SPTE if the userspace VMA is read-only.  EXEC protections are exempt,
> so that guest memory doesn't have to be mapped executable in the VMM, which would
> basically make the VMM a CVE factory :-)
>
> All of that holds true for hugepages as well, because that rule is just a special
> case of the general rule that all memory must be first mapped into the Primary
> MMU.  Rather than query the backing store's allowed page size, KVM x86 simply
> looks at the Primary MMU's userspace page tables.  Originally, KVM _did_ query
> the VMA directly for HugeTLB, but when things like DAX came along, we realized
> that poking into backing stores directly was going to be a maintenance nightmare.
>
> So instead, KVM was reworked to peek at the userspace page tables for everything,
> and knock wood, that approach has Just Worked for all backing stores.
>
> Which actually highlights the brilliance of having KVM be a Secondary MMU that's
> fully subordinate to the Primary MMU.  Modulo some terrible logic with respect to
> VM_PFNMAP and "struct page" that has now been fixed, literally anything that can
> be mapped into the VMM can be mapped into a KVM guest, without KVM needing to
> know *anything* about the underlying memory.
>
> Jumping back to guest_memfd, the main principle of guest_memfd is that it allows
> _KVM_ to be the Primary MMU (mm/ is now becoming another "primary" MMU, but I
> would call KVM 1a and mm/ 1b).  Instead of the VMM's address space and page
> tables being the source of truth, guest_memfd is the source of truth.  And that's
> why I'm so adamant that host_pfn_mapping_level() is completely out of scope for
> guest_memfd; that API _only_ makes sense when KVM is operating as a Seconary MMU.

Thanks! Appreciate the detailed response :)

It fits together for me now.