Re: [PATCH v12 10/18] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Tue, 08 Jul 2025 06:44:14 -0700

Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Tue, Jul 01, 2025, David Hildenbrand wrote:
>> > > > I support this approach.
>> > > 
>> > > Agreed. Let's get this in with the changes requested by Sean applied.
>> > > 
>> > > How to use GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM with
>> > > legacy mem attributes (-> all memory in guest_memfd private) could be
>> > > added later on top, once really required.
>> > > 
>> > > As discussed, CoCo VMs that want to support GUEST_MEMFD_FLAG_MMAP will
>> > > have to disable legacy mem attributes using a new capability in stage-2.
>> > > 
>> > 
>> > I rewatched the guest_memfd meeting on 2025-06-12.  We do want to
>> > support the use case where userspace wants to have mmap (e.g. to set
>> > mempolicy) but does not want to allow faulting into the host.
>> > 
>> > On 2025-06-12, the conclusion was that the problem will be solved once
>> > guest_memfd supports shareability, and that's because userspace can set
>> > shareability to GUEST, so the memory can't be faulted into the host.
>> > 
>> > On 2025-06-26, Sean said we want to let userspace have an extra layer of
>> > protection so that memory cannot be faulted in to the host, ever. IOW,
>> > we want to let userspace say that even if there is a stray
>> > private-to-shared conversion, *don't* allow faulting memory into the
>> > host.
>
> Eh, my comments were more along the lines of "it would be nice if we could have
> such protections", not a "we must support this".  And I suspect that making the
> behavior all-or-nothing for a given guest_memfd wouldn't be very useful, i.e.
> that userspace would probably want to be able to prevent accessing a specific
> chunk of the gmem instance.
>
> Actually, we can probably get that via mseal(), maybe even for free today?  E.g.
> mmap() w/ PROT_NONE, mbind(), and then mseal().
>
> So yeah, I think we do nothing for now.
>
>> > The difference is the "extra layer of protection", which should remain
>> > in effect even if there are (stray/unexpected) private-to-shared
>> > conversions to guest_memfd or to KVM. Here's a direct link to the point
>> > in the video where Sean brought this up [1]. I'm really hoping I didn't
>> > misinterpret this!
>> > 
>> > Let me look ahead a little, since this involves use cases already
>> > brought up though I'm not sure how real they are. I just want to make
>> > sure that in a few patch series' time, we don't end up needing userspace
>> > to use a complex bunch of CAPs and FLAGs.
>> > 
>> > In this series (mmap support, V12, patch 10/18) [2], to allow
>> > KVM_X86_DEFAULT_VMs to use guest_memfd, I added a `fault_from_gmem()`
>> > helper, which is defined as follows (before the renaming Sean requested):
>> > 
>> > +static inline bool fault_from_gmem(struct kvm_page_fault *fault)
>> > +{
>> > +	return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot);
>> > +}
>> > 
>> > The above is changeable, of course :). The intention is that if the
>> > fault is private, fault from guest_memfd. If GUEST_MEMFD_FLAG_MMAP is
>> > set (KVM_MEMSLOT_GMEM_ONLY will be set on the memslot), fault from
>> > guest_memfd.
>> > 
>> > If we defer handling GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM
>> > with legacy mem attributes to the future, this helper will probably
>> > become
>> > 
>> > -static inline bool fault_from_gmem(struct kvm_page_fault *fault)
>> > +static inline bool fault_from_gmem(struct kvm *kvm, struct kvm_page_fault *fault)
>> > +{
>> > -	return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot);
>> > +	return fault->is_private || (kvm_gmem_memslot_supports_shared(fault->slot) &&
>> > +	                             !kvm_arch_disable_legacy_private_tracking(kvm));
>> > +}
>> > 
>> > And on memslot binding we check
>> > 
>> > if kvm_arch_disable_legacy_private_tracking(kvm)
>
> I would invert the KVM-internal arch hook, and only have KVM x86's capability refer
> to the private memory attribute as legacy (because it simply doesn't exist for
> any thing else).
>
>> > and not GUEST_MEMFD_FLAG_MMAP
>> > 	return -EINVAL;
>> > 
>> > 1. Is that what yall meant?
>
> I was thinking:
>
> 	if (kvm_arch_has_private_memory_attribute(kvm) ==
> 	    kvm_gmem_mmap(...))
> 		return -EINVAL;
>
> I.e. in addition to requiring mmap() when KVM doesn't track private/sahred via
> memory attributes, also disallow mmap() when private/shared is tracked via memory
> attributes.
>
>> My understanding:
>> 
>> CoCo VMs will initially (stage-1) only support !GUEST_MEMFD_FLAG_MMAP.
>> 
>> With stage-2, CoCo VMs will support GUEST_MEMFD_FLAG_MMAP only with
>> kvm_arch_disable_legacy_private_tracking().
>
> Yep, and everything except x86 will unconditionally return true for
> kvm_arch_disable_legacy_private_tracking() (or false if it's inverted as above).
>
>> Non-CoCo VMs will only support GUEST_MEMFD_FLAG_MMAP. (no concept of
>> private)
>> 
>> > 
>> > 2. Does this kind of not satisfy the "extra layer of protection"
>> >     requirement (if it is a requirement)?
>
> It's not a requirement.
>
>> >     A legacy CoCo VM using guest_memfd only for private memory (shared
>> >     memory from say, shmem) and needing to set mempolicy would
>> >     * Set GUEST_MEMFD_FLAG_MMAP
>
> I think we should keep it simple as above, and not support mmap() (and therefore
> mbind()) with legacy CoCo VMs.  Given the double allocation flaws with the legacy
> approach, supporting mbind() seems like putting a bandaid on a doomed idea.
>
>> >     * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false
>> >     but still be able to send conversion ioctls directly to guest_memfd,
>> >     and then be able to fault guest_memfd memory into the host.
>> 
>> In that configuration, I would expect that all memory in guest_memfd is
>> private and remains private.
>> 
>> guest_memfd without memory attributes cannot support in-place conversion.
>> 
>> How to achieve that might be interesting: the capability will affect
>> guest_memfd behavior?
>> 
>> > 
>> > 3. Now for a use case I've heard of (feel free to tell me this will
>> >     never be supported or "we'll deal with it if it comes"): On a
>> >     non-CoCo VM, we want to use guest_memfd but not use mmap (and the
>> >     initial VM image will be written using write() syscall or something
>> >     else).
>> > 
>> >     * Set GUEST_MEMFD_FLAG_MMAP to false
>> >     * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false
>> >       (it's a non-CoCo VM, weird to do anything to do with private
>> >       tracking)
>> > 
>> >     And now we're stuck because fault_from_gmem() will return false all
>> >     the time and we can't use memory from guest_memfd.
>
> Nah, don't support this scenario.  Or rather, use mseal() as above.  If someone
> comes along with a concrete, strong use case for backing non-CoCo VMs and using
> mseal() to wall off guest memory doesn't suffice, then they can have the honor
> of justifying why KVM needs to take on more complexity.  :-)
>
>> I think I discussed that with Sean: we would have GUEST_MEMFD_FLAG_WRITE
>> that will imply everything that GUEST_MEMFD_FLAG_MMAP would imply, except
>> the actual mmap() support.
>
> Ya, for the write() access or whatever.  But there are bigger problems beyond
> populating the memory, e.g. a non-CoCo VM won't support private memory, so without
> many more changes to redirect KVM to gmem when faulting in guest memory, KVM won't
> be able to map any memory into the guest.

Thanks for clarifying everything above :). Next respin (with Fuad's
help) coming soon!