Sean Christopherson <seanjc@xxxxxxxxxx> writes: > On Fri, Jul 25, 2025, Ackerley Tng wrote: >> Sean Christopherson <seanjc@xxxxxxxxxx> writes: >> >> > On Fri, Jul 25, 2025, Ackerley Tng wrote: >> >> Sean Christopherson <seanjc@xxxxxxxxxx> writes: >> >> > Invoking host_pfn_mapping_level() isn't just undesirable, it's flat out wrong, as >> >> > KVM will not verify slot->userspace_addr actually points at the (same) guest_memfd >> >> > instance. >> >> > >> >> >> >> This is true too, that invoking host_pfn_mapping_level() could return >> >> totally wrong information if slot->userspace_addr points somewhere else >> >> completely. >> >> >> >> What if slot->userspace_addr is set up to match the fd+offset in the >> >> same guest_memfd, and kvm_gmem_max_mapping_level() returns 2M but it's >> >> actually mapped into the host at 4K? >> >> >> >> A little out of my depth here, but would mappings being recovered to the >> >> 2M level be a problem? >> > >> > No, because again, by design, the host userspace mapping has _zero_ influence on >> > the guest mapping. >> >> Not trying to solve any problem but mostly trying to understand mapping >> levels better. >> >> Before guest_memfd, why does kvm_mmu_max_mapping_level() need to do >> host_pfn_mapping_level()? >> >> Was it about THP folios? > > And HugeTLB, and Device DAX, and probably at least one other type of backing at > this point. > > Without guest_memfd, guest mappings are a strict subset of the host userspace > mappings for the associated address space (i.e. process) (ignoring that the guest > and host mappings are separate page tables). > > When mapping memory into the guest, KVM manages a Secondary MMU (in mmu_notifier > parlance), where the Primary MMU is managed by mm/, and is for all intents and > purposes synonymous with the address space of the userspace VMM. > > To get a pfn to insert into the Secondary MMU's PTEs (SPTE, which was originally > "shadow PTEs", but has been retrofitted to "secondary PTEs" so that it's not an > outright lie when using stage-2 page tables), the pfn *must* be faulted into and > mapped in the Primary MMU. I.e. under no circumstance can a SPTE point at memory > that isn't mapped into the Primary MMU. > > Side note, except for VM_EXEC, protections for Secondary MMU mappings must also > be a strict subset of the Primary MMU's mappings. E.g. KVM can't create a > WRITABLE SPTE if the userspace VMA is read-only. EXEC protections are exempt, > so that guest memory doesn't have to be mapped executable in the VMM, which would > basically make the VMM a CVE factory :-) > > All of that holds true for hugepages as well, because that rule is just a special > case of the general rule that all memory must be first mapped into the Primary > MMU. Rather than query the backing store's allowed page size, KVM x86 simply > looks at the Primary MMU's userspace page tables. Originally, KVM _did_ query > the VMA directly for HugeTLB, but when things like DAX came along, we realized > that poking into backing stores directly was going to be a maintenance nightmare. > > So instead, KVM was reworked to peek at the userspace page tables for everything, > and knock wood, that approach has Just Worked for all backing stores. > > Which actually highlights the brilliance of having KVM be a Secondary MMU that's > fully subordinate to the Primary MMU. Modulo some terrible logic with respect to > VM_PFNMAP and "struct page" that has now been fixed, literally anything that can > be mapped into the VMM can be mapped into a KVM guest, without KVM needing to > know *anything* about the underlying memory. > > Jumping back to guest_memfd, the main principle of guest_memfd is that it allows > _KVM_ to be the Primary MMU (mm/ is now becoming another "primary" MMU, but I > would call KVM 1a and mm/ 1b). Instead of the VMM's address space and page > tables being the source of truth, guest_memfd is the source of truth. And that's > why I'm so adamant that host_pfn_mapping_level() is completely out of scope for > guest_memfd; that API _only_ makes sense when KVM is operating as a Seconary MMU. Thanks! Appreciate the detailed response :) It fits together for me now.