Sean Christopherson <seanjc@xxxxxxxxxx> writes: > On Fri, Jul 25, 2025, Ackerley Tng wrote: >> Sean Christopherson <seanjc@xxxxxxxxxx> writes: >> >> > On Thu, Jul 24, 2025, Ackerley Tng wrote: >> >> Fuad Tabba <tabba@xxxxxxxxxx> writes: >> >> > int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault, >> >> > @@ -3362,8 +3371,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault, >> >> > if (max_level == PG_LEVEL_4K) >> >> > return PG_LEVEL_4K; >> >> > >> >> > - if (is_private) >> >> > - host_level = kvm_max_private_mapping_level(kvm, fault, slot, gfn); >> >> > + if (is_private || kvm_memslot_is_gmem_only(slot)) >> >> > + host_level = kvm_gmem_max_mapping_level(kvm, fault, slot, gfn, >> >> > + is_private); >> >> > else >> >> > host_level = host_pfn_mapping_level(kvm, gfn, slot); >> >> >> >> No change required now, would like to point out that in this change >> >> there's a bit of an assumption if kvm_memslot_is_gmem_only(), even for >> >> shared pages, guest_memfd will be the only source of truth. >> > >> > It's not an assumption, it's a hard requirement. >> > >> >> This holds now because shared pages are always split to 4K, but if >> >> shared pages become larger, might mapping in the host actually turn out >> >> to be smaller? >> > >> > Yes, the host userspace mappens could be smaller, and supporting that scenario is >> > very explicitly one of the design goals of guest_memfd. From commit a7800aa80ea4 >> > ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory"): >> > >> > : A guest-first memory subsystem allows for optimizations and enhancements >> > : that are kludgy or outright infeasible to implement/support in a generic >> > : memory subsystem. With guest_memfd, guest protections and mapping sizes >> > : are fully decoupled from host userspace mappings. E.g. KVM currently >> > : doesn't support mapping memory as writable in the guest without it also >> > : being writable in host userspace, as KVM's ABI uses VMA protections to >> > : define the allow guest protection. Userspace can fudge this by >> > : establishing two mappings, a writable mapping for the guest and readable >> > : one for itself, but that’s suboptimal on multiple fronts. >> > : >> > : Similarly, KVM currently requires the guest mapping size to be a strict >> > : subset of the host userspace mapping size, e.g. KVM doesn’t support >> > : creating a 1GiB guest mapping unless userspace also has a 1GiB guest >> > : mapping. Decoupling the mappings sizes would allow userspace to precisely >> > : map only what is needed without impacting guest performance, e.g. to >> > : harden against unintentional accesses to guest memory. >> >> Let me try to understand this better. If/when guest_memfd supports >> larger folios for shared pages, and guest_memfd returns a 2M folio from >> kvm_gmem_fault_shared(), can the mapping in host userspace turn out >> to be 4K? > > It can be 2M, 4K, or none. > >> If that happens, should kvm_gmem_max_mapping_level() return 4K for a >> memslot with kvm_memslot_is_gmem_only() == true? > > No. > >> The above code would skip host_pfn_mapping_level() and return just what >> guest_memfd reports, which is 2M. > > Yes. > >> Or do you mean that guest_memfd will be the source of truth in that it >> must also know/control, in the above scenario, that the host mapping is >> also 2M? > > No. The userspace mapping, _if_ there is one, is completely irrelevant. The > entire point of guest_memfd is eliminate the requirement that memory be mapped > into host userspace in order for that memory to be mapped into the guest. > If it's not mapped into the host at all, host_pfn_mapping_level() would default to 4K and I think that's a safe default. > Invoking host_pfn_mapping_level() isn't just undesirable, it's flat out wrong, as > KVM will not verify slot->userspace_addr actually points at the (same) guest_memfd > instance. > This is true too, that invoking host_pfn_mapping_level() could return totally wrong information if slot->userspace_addr points somewhere else completely. What if slot->userspace_addr is set up to match the fd+offset in the same guest_memfd, and kvm_gmem_max_mapping_level() returns 2M but it's actually mapped into the host at 4K? A little out of my depth here, but would mappings being recovered to the 2M level be a problem? For enforcement of shared/private-ness of memory, recovering the mappings to the 2M level is okay since if some part had been private, guest_memfd wouldn't have returned 2M. As for alignment, if guest_memfd could return 2M to kvm_gmem_max_mapping_level(), then userspace_addr would have been 2M aligned, which would correctly permit mapping recovery to 2M, so that sounds like it works too. Maybe the right solution here is that since slot->userspace_addr need not point at the same guest_memfd+offset configured in the memslot, when guest_memfd responds to kvm_gmem_max_mapping_level(), it should check if the requested GFN is mapped in host userspace, and if so, return the smaller of the two mapping levels. > To demonstrate, this must pass (and does once "KVM: x86/mmu: Handle guest page > faults for guest_memfd with shared memory" is added back). > Makes sense :) [snip]