Sean Christopherson <seanjc@xxxxxxxxxx> writes: > On Tue, Jul 01, 2025, David Hildenbrand wrote: >> > > > I support this approach. >> > > >> > > Agreed. Let's get this in with the changes requested by Sean applied. >> > > >> > > How to use GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM with >> > > legacy mem attributes (-> all memory in guest_memfd private) could be >> > > added later on top, once really required. >> > > >> > > As discussed, CoCo VMs that want to support GUEST_MEMFD_FLAG_MMAP will >> > > have to disable legacy mem attributes using a new capability in stage-2. >> > > >> > >> > I rewatched the guest_memfd meeting on 2025-06-12. We do want to >> > support the use case where userspace wants to have mmap (e.g. to set >> > mempolicy) but does not want to allow faulting into the host. >> > >> > On 2025-06-12, the conclusion was that the problem will be solved once >> > guest_memfd supports shareability, and that's because userspace can set >> > shareability to GUEST, so the memory can't be faulted into the host. >> > >> > On 2025-06-26, Sean said we want to let userspace have an extra layer of >> > protection so that memory cannot be faulted in to the host, ever. IOW, >> > we want to let userspace say that even if there is a stray >> > private-to-shared conversion, *don't* allow faulting memory into the >> > host. > > Eh, my comments were more along the lines of "it would be nice if we could have > such protections", not a "we must support this". And I suspect that making the > behavior all-or-nothing for a given guest_memfd wouldn't be very useful, i.e. > that userspace would probably want to be able to prevent accessing a specific > chunk of the gmem instance. > > Actually, we can probably get that via mseal(), maybe even for free today? E.g. > mmap() w/ PROT_NONE, mbind(), and then mseal(). > > So yeah, I think we do nothing for now. > >> > The difference is the "extra layer of protection", which should remain >> > in effect even if there are (stray/unexpected) private-to-shared >> > conversions to guest_memfd or to KVM. Here's a direct link to the point >> > in the video where Sean brought this up [1]. I'm really hoping I didn't >> > misinterpret this! >> > >> > Let me look ahead a little, since this involves use cases already >> > brought up though I'm not sure how real they are. I just want to make >> > sure that in a few patch series' time, we don't end up needing userspace >> > to use a complex bunch of CAPs and FLAGs. >> > >> > In this series (mmap support, V12, patch 10/18) [2], to allow >> > KVM_X86_DEFAULT_VMs to use guest_memfd, I added a `fault_from_gmem()` >> > helper, which is defined as follows (before the renaming Sean requested): >> > >> > +static inline bool fault_from_gmem(struct kvm_page_fault *fault) >> > +{ >> > + return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot); >> > +} >> > >> > The above is changeable, of course :). The intention is that if the >> > fault is private, fault from guest_memfd. If GUEST_MEMFD_FLAG_MMAP is >> > set (KVM_MEMSLOT_GMEM_ONLY will be set on the memslot), fault from >> > guest_memfd. >> > >> > If we defer handling GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM >> > with legacy mem attributes to the future, this helper will probably >> > become >> > >> > -static inline bool fault_from_gmem(struct kvm_page_fault *fault) >> > +static inline bool fault_from_gmem(struct kvm *kvm, struct kvm_page_fault *fault) >> > +{ >> > - return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot); >> > + return fault->is_private || (kvm_gmem_memslot_supports_shared(fault->slot) && >> > + !kvm_arch_disable_legacy_private_tracking(kvm)); >> > +} >> > >> > And on memslot binding we check >> > >> > if kvm_arch_disable_legacy_private_tracking(kvm) > > I would invert the KVM-internal arch hook, and only have KVM x86's capability refer > to the private memory attribute as legacy (because it simply doesn't exist for > any thing else). > >> > and not GUEST_MEMFD_FLAG_MMAP >> > return -EINVAL; >> > >> > 1. Is that what yall meant? > > I was thinking: > > if (kvm_arch_has_private_memory_attribute(kvm) == > kvm_gmem_mmap(...)) > return -EINVAL; > > I.e. in addition to requiring mmap() when KVM doesn't track private/sahred via > memory attributes, also disallow mmap() when private/shared is tracked via memory > attributes. > >> My understanding: >> >> CoCo VMs will initially (stage-1) only support !GUEST_MEMFD_FLAG_MMAP. >> >> With stage-2, CoCo VMs will support GUEST_MEMFD_FLAG_MMAP only with >> kvm_arch_disable_legacy_private_tracking(). > > Yep, and everything except x86 will unconditionally return true for > kvm_arch_disable_legacy_private_tracking() (or false if it's inverted as above). > >> Non-CoCo VMs will only support GUEST_MEMFD_FLAG_MMAP. (no concept of >> private) >> >> > >> > 2. Does this kind of not satisfy the "extra layer of protection" >> > requirement (if it is a requirement)? > > It's not a requirement. > >> > A legacy CoCo VM using guest_memfd only for private memory (shared >> > memory from say, shmem) and needing to set mempolicy would >> > * Set GUEST_MEMFD_FLAG_MMAP > > I think we should keep it simple as above, and not support mmap() (and therefore > mbind()) with legacy CoCo VMs. Given the double allocation flaws with the legacy > approach, supporting mbind() seems like putting a bandaid on a doomed idea. > >> > * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false >> > but still be able to send conversion ioctls directly to guest_memfd, >> > and then be able to fault guest_memfd memory into the host. >> >> In that configuration, I would expect that all memory in guest_memfd is >> private and remains private. >> >> guest_memfd without memory attributes cannot support in-place conversion. >> >> How to achieve that might be interesting: the capability will affect >> guest_memfd behavior? >> >> > >> > 3. Now for a use case I've heard of (feel free to tell me this will >> > never be supported or "we'll deal with it if it comes"): On a >> > non-CoCo VM, we want to use guest_memfd but not use mmap (and the >> > initial VM image will be written using write() syscall or something >> > else). >> > >> > * Set GUEST_MEMFD_FLAG_MMAP to false >> > * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false >> > (it's a non-CoCo VM, weird to do anything to do with private >> > tracking) >> > >> > And now we're stuck because fault_from_gmem() will return false all >> > the time and we can't use memory from guest_memfd. > > Nah, don't support this scenario. Or rather, use mseal() as above. If someone > comes along with a concrete, strong use case for backing non-CoCo VMs and using > mseal() to wall off guest memory doesn't suffice, then they can have the honor > of justifying why KVM needs to take on more complexity. :-) > >> I think I discussed that with Sean: we would have GUEST_MEMFD_FLAG_WRITE >> that will imply everything that GUEST_MEMFD_FLAG_MMAP would imply, except >> the actual mmap() support. > > Ya, for the write() access or whatever. But there are bigger problems beyond > populating the memory, e.g. a non-CoCo VM won't support private memory, so without > many more changes to redirect KVM to gmem when faulting in guest memory, KVM won't > be able to map any memory into the guest. Thanks for clarifying everything above :). Next respin (with Fuad's help) coming soon!