On Thu, 2025-08-28 at 15:54 +0100, Mike Rapoport wrote: > On Thu, Aug 28, 2025 at 09:39:21AM +0000, Roy, Patrick wrote: >> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() >> ioctl. When set, guest_memfd folios will be removed from the direct map >> after preparation, with direct map entries only restored when the folios >> are freed. >> >> To ensure these folios do not end up in places where the kernel cannot >> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct >> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested. >> >> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether >> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on >> guest_memfd itself being supported, but also on whether KVM can >> manipulate the direct map at page granularity at all (possible most of >> the time, just arm64 is a notable outlier where its impossible if the >> direct map has been setup using hugepages, as arm64 cannot break these >> apart due to break-before-make semantics). > > There's also powerpc that does not select ARCH_HAS_SET_DIRECT_MAP Ah, thanks! Although powerpc also doesnt support guest_memfd in the first place, but will mention. >> Note that this flag causes removal of direct map entries for all >> guest_memfd folios independent of whether they are "shared" or "private" >> (although current guest_memfd only supports either all folios in the >> "shared" state, or all folios in the "private" state if >> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map >> entries of also the shared parts of guest_memfd are a special type of >> non-CoCo VM where, host userspace is trusted to have access to all of >> guest memory, but where Spectre-style transient execution attacks >> through the host kernel's direct map should still be mitigated. In this >> setup, KVM retains access to guest memory via userspace mappings of >> guest_memfd, which are reflected back into KVM's memslots via >> userspace_addr. This is needed for things like MMIO emulation on x86_64 >> to work. >> >> Do not perform TLB flushes after direct map manipulations. This is >> because TLB flushes resulted in a up to 40x elongation of page faults in >> guest_memfd (scaling with the number of CPU cores), or a 5x elongation >> of memory population. TLB flushes are not needed for functional >> correctness (the virt->phys mapping technically stays "correct", the >> kernel should simply to not it for a while). On the other hand, it means > > ^ not use it? Yup, thanks! >> that the desired protection from Spectre-style attacks is not perfect, >> as an attacker could try to prevent a stale TLB entry from getting >> evicted, keeping it alive until the page it refers to is used by the >> guest for some sensitive data, and then targeting it using a >> spectre-gadget. >> >> Signed-off-by: Patrick Roy <roypat@xxxxxxxxxxxx> >> --- >> arch/arm64/include/asm/kvm_host.h | 12 ++++++++++++ >> include/linux/kvm_host.h | 7 +++++++ >> include/uapi/linux/kvm.h | 2 ++ >> virt/kvm/guest_memfd.c | 29 +++++++++++++++++++++++++---- >> virt/kvm/kvm_main.c | 5 +++++ >> 5 files changed, 51 insertions(+), 4 deletions(-) > > ... > >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c >> index 9ec4c45e3cf2..e3696880405c 100644 >> --- a/virt/kvm/guest_memfd.c >> +++ b/virt/kvm/guest_memfd.c >> @@ -4,6 +4,7 @@ >> #include <linux/kvm_host.h> >> #include <linux/pagemap.h> >> #include <linux/anon_inodes.h> >> +#include <linux/set_memory.h> >> >> #include "kvm_mm.h" >> >> @@ -42,8 +43,18 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo >> return 0; >> } >> >> +static bool kvm_gmem_test_no_direct_map(struct inode *inode) >> +{ >> + return ((unsigned long) inode->i_private) & GUEST_MEMFD_FLAG_NO_DIRECT_MAP; >> +} >> + >> static inline void kvm_gmem_mark_prepared(struct folio *folio) >> { >> + struct inode *inode = folio_inode(folio); >> + >> + if (kvm_gmem_test_no_direct_map(inode)) >> + set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), false); > > This may fail to split large mapping in the direct map. Why not move this > to kvm_gmem_prepare_folio() where you can handle returned error? Argh, yeah, got that the wrong way around. Will update the error handling. > I think that using set_direct_map_invalid_noflush() here and > set_direct_map_default_noflush() in kvm_gmem_free_folio() better is > clearer and makes it more obvious that here the folio is removed from the > direct map and when freed it's direct mapping is restored. > > This requires to export two symbols in patch 2, but I think it's worth it. Mh, but set_direct_map_[default|invalid]_noflush() only take a single struct page * argument, so they'd either need to gain a npages argument, or we add yet more functions to set_memory.h. Do you still think that's worth it? >> folio_mark_uptodate(folio); >> } >> >> @@ -429,25 +440,29 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol >> return MF_DELAYED; >> } >> >> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE >> static void kvm_gmem_free_folio(struct address_space *mapping, >> struct folio *folio) >> { >> struct page *page = folio_page(folio, 0); >> + >> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE >> kvm_pfn_t pfn = page_to_pfn(page); >> int order = folio_order(folio); >> +#endif >> >> + if (kvm_gmem_test_no_direct_map(mapping->host)) >> + WARN_ON_ONCE(set_direct_map_valid_noflush(page, folio_nr_pages(folio), true)); > > I don't think it can fail here. The direct map was split when you removed > the folio so here it will merely update the prot bits. Yup, will drop this WARN_ON_ONCE. >> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE >> kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); >> -} >> #endif >> +} > > Instead of moving #ifdefs into kvm_gmem_free_folio() it's better to add, say, > kvm_gmem_invalidate() and move ifdefery there or even better have a static > inline stub for !CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE case. Ack, will do the latter >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 18f29ef93543..0dbfd17e1191 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -65,6 +65,7 @@ >> #include <trace/events/kvm.h> >> >> #include <linux/kvm_dirty_ring.h> >> +#include <linux/set_memory.h> >> >> >> /* Worst case buffer size needed for holding an integer. */ >> @@ -4916,6 +4917,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) >> return kvm_supported_mem_attributes(kvm); >> #endif >> #ifdef CONFIG_KVM_GUEST_MEMFD >> + case KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP: >> + if (!can_set_direct_map()) > > Shouldn't this check with kvm_arch_gmem_supports_no_direct_map()? Absolutely, thanks for catching! >> + return false; >> + fallthrough; >> case KVM_CAP_GUEST_MEMFD: >> return 1; >> case KVM_CAP_GUEST_MEMFD_MMAP: >> -- >> 2.50.1 >> > > -- > Sincerely yours, > Mike. Best, Patrick