Re: [PATCH v16 15/22] KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 25, 2025, Ackerley Tng wrote:
> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
> 
> > On Fri, Jul 25, 2025, Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
> >> > Invoking host_pfn_mapping_level() isn't just undesirable, it's flat out wrong, as
> >> > KVM will not verify slot->userspace_addr actually points at the (same) guest_memfd
> >> > instance.
> >> >
> >> 
> >> This is true too, that invoking host_pfn_mapping_level() could return
> >> totally wrong information if slot->userspace_addr points somewhere else
> >> completely.
> >> 
> >> What if slot->userspace_addr is set up to match the fd+offset in the
> >> same guest_memfd, and kvm_gmem_max_mapping_level() returns 2M but it's
> >> actually mapped into the host at 4K?
> >> 
> >> A little out of my depth here, but would mappings being recovered to the
> >> 2M level be a problem?
> >
> > No, because again, by design, the host userspace mapping has _zero_ influence on
> > the guest mapping.
> 
> Not trying to solve any problem but mostly trying to understand mapping
> levels better.
> 
> Before guest_memfd, why does kvm_mmu_max_mapping_level() need to do
> host_pfn_mapping_level()?
> 
> Was it about THP folios?

And HugeTLB, and Device DAX, and probably at least one other type of backing at
this point.

Without guest_memfd, guest mappings are a strict subset of the host userspace
mappings for the associated address space (i.e. process) (ignoring that the guest
and host mappings are separate page tables).

When mapping memory into the guest, KVM manages a Secondary MMU (in mmu_notifier
parlance), where the Primary MMU is managed by mm/, and is for all intents and
purposes synonymous with the address space of the userspace VMM.

To get a pfn to insert into the Secondary MMU's PTEs (SPTE, which was originally
"shadow PTEs", but has been retrofitted to "secondary PTEs" so that it's not an
outright lie when using stage-2 page tables), the pfn *must* be faulted into and
mapped in the Primary MMU.  I.e. under no circumstance can a SPTE point at memory
that isn't mapped into the Primary MMU.

Side note, except for VM_EXEC, protections for Secondary MMU mappings must also
be a strict subset of the Primary MMU's mappings.  E.g. KVM can't create a
WRITABLE SPTE if the userspace VMA is read-only.  EXEC protections are exempt,
so that guest memory doesn't have to be mapped executable in the VMM, which would
basically make the VMM a CVE factory :-)

All of that holds true for hugepages as well, because that rule is just a special
case of the general rule that all memory must be first mapped into the Primary
MMU.  Rather than query the backing store's allowed page size, KVM x86 simply
looks at the Primary MMU's userspace page tables.  Originally, KVM _did_ query
the VMA directly for HugeTLB, but when things like DAX came along, we realized
that poking into backing stores directly was going to be a maintenance nightmare.

So instead, KVM was reworked to peek at the userspace page tables for everything,
and knock wood, that approach has Just Worked for all backing stores.

Which actually highlights the brilliance of having KVM be a Secondary MMU that's
fully subordinate to the Primary MMU.  Modulo some terrible logic with respect to
VM_PFNMAP and "struct page" that has now been fixed, literally anything that can
be mapped into the VMM can be mapped into a KVM guest, without KVM needing to
know *anything* about the underlying memory.

Jumping back to guest_memfd, the main principle of guest_memfd is that it allows
_KVM_ to be the Primary MMU (mm/ is now becoming another "primary" MMU, but I
would call KVM 1a and mm/ 1b).  Instead of the VMM's address space and page
tables being the source of truth, guest_memfd is the source of truth.  And that's
why I'm so adamant that host_pfn_mapping_level() is completely out of scope for
guest_memfd; that API _only_ makes sense when KVM is operating as a Seconary MMU.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux