Re: [PATCH v12 10/18] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 7 Jul 2025 17:05:31 -0700

On Tue, Jul 01, 2025, David Hildenbrand wrote:
> > > > I support this approach.
> > > 
> > > Agreed. Let's get this in with the changes requested by Sean applied.
> > > 
> > > How to use GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM with
> > > legacy mem attributes (-> all memory in guest_memfd private) could be
> > > added later on top, once really required.
> > > 
> > > As discussed, CoCo VMs that want to support GUEST_MEMFD_FLAG_MMAP will
> > > have to disable legacy mem attributes using a new capability in stage-2.
> > > 
> > 
> > I rewatched the guest_memfd meeting on 2025-06-12.  We do want to
> > support the use case where userspace wants to have mmap (e.g. to set
> > mempolicy) but does not want to allow faulting into the host.
> > 
> > On 2025-06-12, the conclusion was that the problem will be solved once
> > guest_memfd supports shareability, and that's because userspace can set
> > shareability to GUEST, so the memory can't be faulted into the host.
> > 
> > On 2025-06-26, Sean said we want to let userspace have an extra layer of
> > protection so that memory cannot be faulted in to the host, ever. IOW,
> > we want to let userspace say that even if there is a stray
> > private-to-shared conversion, *don't* allow faulting memory into the
> > host.

Eh, my comments were more along the lines of "it would be nice if we could have
such protections", not a "we must support this".  And I suspect that making the
behavior all-or-nothing for a given guest_memfd wouldn't be very useful, i.e.
that userspace would probably want to be able to prevent accessing a specific
chunk of the gmem instance.

Actually, we can probably get that via mseal(), maybe even for free today?  E.g.
mmap() w/ PROT_NONE, mbind(), and then mseal().

So yeah, I think we do nothing for now.

> > The difference is the "extra layer of protection", which should remain
> > in effect even if there are (stray/unexpected) private-to-shared
> > conversions to guest_memfd or to KVM. Here's a direct link to the point
> > in the video where Sean brought this up [1]. I'm really hoping I didn't
> > misinterpret this!
> > 
> > Let me look ahead a little, since this involves use cases already
> > brought up though I'm not sure how real they are. I just want to make
> > sure that in a few patch series' time, we don't end up needing userspace
> > to use a complex bunch of CAPs and FLAGs.
> > 
> > In this series (mmap support, V12, patch 10/18) [2], to allow
> > KVM_X86_DEFAULT_VMs to use guest_memfd, I added a `fault_from_gmem()`
> > helper, which is defined as follows (before the renaming Sean requested):
> > 
> > +static inline bool fault_from_gmem(struct kvm_page_fault *fault)
> > +{
> > +	return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot);
> > +}
> > 
> > The above is changeable, of course :). The intention is that if the
> > fault is private, fault from guest_memfd. If GUEST_MEMFD_FLAG_MMAP is
> > set (KVM_MEMSLOT_GMEM_ONLY will be set on the memslot), fault from
> > guest_memfd.
> > 
> > If we defer handling GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM
> > with legacy mem attributes to the future, this helper will probably
> > become
> > 
> > -static inline bool fault_from_gmem(struct kvm_page_fault *fault)
> > +static inline bool fault_from_gmem(struct kvm *kvm, struct kvm_page_fault *fault)
> > +{
> > -	return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot);
> > +	return fault->is_private || (kvm_gmem_memslot_supports_shared(fault->slot) &&
> > +	                             !kvm_arch_disable_legacy_private_tracking(kvm));
> > +}
> > 
> > And on memslot binding we check
> > 
> > if kvm_arch_disable_legacy_private_tracking(kvm)

I would invert the KVM-internal arch hook, and only have KVM x86's capability refer
to the private memory attribute as legacy (because it simply doesn't exist for
any thing else).

> > and not GUEST_MEMFD_FLAG_MMAP
> > 	return -EINVAL;
> > 
> > 1. Is that what yall meant?

I was thinking:

	if (kvm_arch_has_private_memory_attribute(kvm) ==
	    kvm_gmem_mmap(...))
		return -EINVAL;

I.e. in addition to requiring mmap() when KVM doesn't track private/sahred via
memory attributes, also disallow mmap() when private/shared is tracked via memory
attributes.

> My understanding:
> 
> CoCo VMs will initially (stage-1) only support !GUEST_MEMFD_FLAG_MMAP.
> 
> With stage-2, CoCo VMs will support GUEST_MEMFD_FLAG_MMAP only with
> kvm_arch_disable_legacy_private_tracking().

Yep, and everything except x86 will unconditionally return true for
kvm_arch_disable_legacy_private_tracking() (or false if it's inverted as above).

> Non-CoCo VMs will only support GUEST_MEMFD_FLAG_MMAP. (no concept of
> private)
> 
> > 
> > 2. Does this kind of not satisfy the "extra layer of protection"
> >     requirement (if it is a requirement)?

It's not a requirement.

> >     A legacy CoCo VM using guest_memfd only for private memory (shared
> >     memory from say, shmem) and needing to set mempolicy would
> >     * Set GUEST_MEMFD_FLAG_MMAP

I think we should keep it simple as above, and not support mmap() (and therefore
mbind()) with legacy CoCo VMs.  Given the double allocation flaws with the legacy
approach, supporting mbind() seems like putting a bandaid on a doomed idea.

> >     * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false
> >     but still be able to send conversion ioctls directly to guest_memfd,
> >     and then be able to fault guest_memfd memory into the host.
> 
> In that configuration, I would expect that all memory in guest_memfd is
> private and remains private.
> 
> guest_memfd without memory attributes cannot support in-place conversion.
> 
> How to achieve that might be interesting: the capability will affect
> guest_memfd behavior?
> 
> > 
> > 3. Now for a use case I've heard of (feel free to tell me this will
> >     never be supported or "we'll deal with it if it comes"): On a
> >     non-CoCo VM, we want to use guest_memfd but not use mmap (and the
> >     initial VM image will be written using write() syscall or something
> >     else).
> > 
> >     * Set GUEST_MEMFD_FLAG_MMAP to false
> >     * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false
> >       (it's a non-CoCo VM, weird to do anything to do with private
> >       tracking)
> > 
> >     And now we're stuck because fault_from_gmem() will return false all
> >     the time and we can't use memory from guest_memfd.

Nah, don't support this scenario.  Or rather, use mseal() as above.  If someone
comes along with a concrete, strong use case for backing non-CoCo VMs and using
mseal() to wall off guest memory doesn't suffice, then they can have the honor
of justifying why KVM needs to take on more complexity.  :-)

> I think I discussed that with Sean: we would have GUEST_MEMFD_FLAG_WRITE
> that will imply everything that GUEST_MEMFD_FLAG_MMAP would imply, except
> the actual mmap() support.

Ya, for the write() access or whatever.  But there are bigger problems beyond
populating the memory, e.g. a non-CoCo VM won't support private memory, so without
many more changes to redirect KVM to gmem when faulting in guest memory, KVM won't
be able to map any memory into the guest.