[ based on kvm/next ] Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1]. This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd. === Design === We build on top of guest_memfd's recent support for "non-confidential VMs", in which all of guest_memfd is mappable to userspace (e.g. considered "shared"). For such VMs, all guest page faults are routed through guest_memfd's special page fault handler, which due to consuming fd+offset directly, can map direct map removed memory into the guest. KVM's internal accesses to guest memory are handled by providing each memslot with a userspace mapping of that memslots guest_memfd via userspace_addr. Since KVM's internal accesses are almost exclusively handled via copy_from_user() and friends, this allows KVM to access direct map removed guest memory for features such as MMIO instruction emulation on x86 or pvtime support on ARM64. === Implementation === The KVM_CREATE_GUEST_MEMFD ioctl gains a new flag GUEST_MEMFD_FLAG_NO_DIRECT_MAP. If this flag is passed, then guest_memfd removes direct map entries for its folios are preparation. Upon free-ing of the memory, direct map entries are restored prior to gmem's arch specific invalidation callback. Support for the flag can be discovered via the KVM_CAP_GMEM_NO_DIRECT_MAP capability, which is only available if direct map modifications at 4k granularity is architecturally possible / when KVM can successfully map direct map removed memory into the guest. === Testing === KVM selftests are extended to cover the above-described non-CoCo workflows, where guest_memfd with direct map entries removed is used to back all of guest memory, and exercising some simple MMIO paths. Additionally, a Firecracker branch with support for these VMs can be found on GitHub [2]. === Changes since v4 === - Rebase on top of kvm/next - Stop using PG_private to track direct map removal state - fix build or KVM-as-a-module by using new EXPORT_SYMBOL_FOR_MODULES === FAQ === --- why not reuse memfd_secret() / a bespoke guest memory solution? --- having guest memory be direct map removed means guest page faults cannot be resolved by GUP-ing userspace mappings of guest memory, as GUP is disabled for direct map removed memory (as currently GUP has no way to understand that a specific GUP request will not subsequently dereference page_address()). guest_memfd already has a special path inside KVM that instead consumed fd+offset, so it makes sense to reuse this. Additionally, it means that direct-map-removed VMs can benefit from active development on guest_memfd, such as huge pages support. --- why do KVM internal accesses through userspace page tables? --- For traditional VMs, all KVM internal accesses are done through the userspace_addr stored in a memslot, meaning no changes to most KVM code are needed just to allow access to guest_memfd backed / direct map removed guest memory of non-confidential VMs. Previous iterations of this series tried to avoid userspace mappings, instead attempting to dynamically restore direct map entries for internal accesses [RFCv2], but this turned out to have a significant performance impact, as well as additional complexity due to needing to refcount direct map reinsertion operations and making them play nicely with gmem truncations. --- what doesn't work with direct map removed VMs? --- The only thing I'm aware of is kvm-clock, since it tries to GUP guest memory via gfn_to_pfn_cache. Realistically, this is only a problem on AMD, as on Intel guests can use TSC as a clocksource (Intel allows discovery of TSC frequency via CPUID, while AMD doesn't). AMD guests fall back onto some calibration routine, which fails most of the time though. [1]: https://download.vusec.net/papers/quarantine_raid23.pdf [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding [RFCv1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@xxxxxxxxxxxx/ [RFCv2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@xxxxxxxxxxxx/ [RFCv3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@xxxxxxxxxxxx/ [v4]: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@xxxxxxxxxxxx/ Elliot Berman (1): filemap: Pass address_space mapping to ->free_folio() Patrick Roy (11): arch: export set_direct_map_valid_noflush to KVM module mm: introduce AS_NO_DIRECT_MAP KVM: guest_memfd: Add flag to remove from direct map KVM: Documentation: describe GUEST_MEMFD_FLAG_NO_DIRECT_MAP KVM: selftests: load elf via bounce buffer KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd != -1 KVM: selftests: Add guest_memfd based vm_mem_backing_src_types KVM: selftests: stuff vm_mem_backing_src_type into vm_shape KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in mem conversion tests KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in guest_memfd_test.c KVM: selftests: Test guest execution from direct map removed gmem Documentation/filesystems/locking.rst | 2 +- Documentation/virt/kvm/api.rst | 5 ++ arch/arm64/include/asm/kvm_host.h | 12 ++++ arch/arm64/mm/pageattr.c | 1 + arch/loongarch/mm/pageattr.c | 1 + arch/riscv/mm/pageattr.c | 1 + arch/s390/mm/pageattr.c | 1 + arch/x86/mm/pat/set_memory.c | 1 + fs/nfs/dir.c | 11 ++-- fs/orangefs/inode.c | 3 +- include/linux/fs.h | 2 +- include/linux/kvm_host.h | 7 +++ include/linux/pagemap.h | 16 +++++ include/linux/secretmem.h | 18 ------ include/uapi/linux/kvm.h | 2 + lib/buildid.c | 4 +- mm/filemap.c | 9 +-- mm/gup.c | 14 +---- mm/mlock.c | 2 +- mm/secretmem.c | 9 +-- mm/vmscan.c | 4 +- .../testing/selftests/kvm/guest_memfd_test.c | 2 + .../testing/selftests/kvm/include/kvm_util.h | 37 ++++++++--- .../testing/selftests/kvm/include/test_util.h | 8 +++ tools/testing/selftests/kvm/lib/elf.c | 8 +-- tools/testing/selftests/kvm/lib/io.c | 23 +++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 61 +++++++++++-------- tools/testing/selftests/kvm/lib/test_util.c | 8 +++ tools/testing/selftests/kvm/lib/x86/sev.c | 1 + .../selftests/kvm/pre_fault_memory_test.c | 1 + .../selftests/kvm/set_memory_region_test.c | 50 +++++++++++++-- .../kvm/x86/private_mem_conversions_test.c | 7 ++- virt/kvm/guest_memfd.c | 32 ++++++++-- virt/kvm/kvm_main.c | 5 ++ 34 files changed, 264 insertions(+), 104 deletions(-) base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383 -- 2.50.1