Hi Fuad,
On 5/16/25 5:56 PM, Fuad Tabba wrote:
On Fri, 16 May 2025 at 08:09, Gavin Shan <gshan@xxxxxxxxxx> wrote:
On 5/14/25 2:34 AM, Fuad Tabba wrote:
This patch enables support for shared memory in guest_memfd, including
mapping that memory at the host userspace. This support is gated by the
configuration option KVM_GMEM_SHARED_MEM, and toggled by the guest_memfd
flag GUEST_MEMFD_FLAG_SUPPORT_SHARED, which can be set when creating a
guest_memfd instance.
Co-developed-by: Ackerley Tng <ackerleytng@xxxxxxxxxx>
Signed-off-by: Ackerley Tng <ackerleytng@xxxxxxxxxx>
Signed-off-by: Fuad Tabba <tabba@xxxxxxxxxx>
---
arch/x86/include/asm/kvm_host.h | 10 ++++
include/linux/kvm_host.h | 13 +++++
include/uapi/linux/kvm.h | 1 +
virt/kvm/Kconfig | 5 ++
virt/kvm/guest_memfd.c | 88 +++++++++++++++++++++++++++++++++
5 files changed, 117 insertions(+)
[...]
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 6db515833f61..8e6d1866b55e 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -312,7 +312,88 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
return gfn - slot->base_gfn + slot->gmem.pgoff;
}
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+
+static bool kvm_gmem_supports_shared(struct inode *inode)
+{
+ uint64_t flags = (uint64_t)inode->i_private;
+
+ return flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED;
+}
+
+static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct folio *folio;
+ vm_fault_t ret = VM_FAULT_LOCKED;
+
+ filemap_invalidate_lock_shared(inode->i_mapping);
+
+ folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+ if (IS_ERR(folio)) {
+ int err = PTR_ERR(folio);
+
+ if (err == -EAGAIN)
+ ret = VM_FAULT_RETRY;
+ else
+ ret = vmf_error(err);
+
+ goto out_filemap;
+ }
+
+ if (folio_test_hwpoison(folio)) {
+ ret = VM_FAULT_HWPOISON;
+ goto out_folio;
+ }
+
+ if (WARN_ON_ONCE(folio_test_large(folio))) {
+ ret = VM_FAULT_SIGBUS;
+ goto out_folio;
+ }
+
I don't think there is a large folio involved since the max/min folio order
(stored in struct address_space::flags) should have been set to 0, meaning
only order-0 is possible when the folio (page) is allocated and added to the
page-cache. More details can be referred to AS_FOLIO_ORDER_MASK. It's unnecessary
check but not harmful. Maybe a comment is needed to mention large folio isn't
around yet, but double confirm.
The idea is to document the lack of hugepage support in code, but if
you think it's necessary, I could add a comment.
Ok, I was actually nit-picky since we're at v9, which is close to integration,
I guess. If another respin is needed, a comment wouldn't be harmful, but it's
also perfectly fine without it :)
+ if (!folio_test_uptodate(folio)) {
+ clear_highpage(folio_page(folio, 0));
+ kvm_gmem_mark_prepared(folio);
+ }
+
I must be missing some thing here. This chunk of code is out of sync to kvm_gmem_get_pfn(),
where kvm_gmem_prepare_folio() and kvm_arch_gmem_prepare() are executed, and then
PG_uptodate is set after that. In the latest ARM CCA series, kvm_arch_gmem_prepare()
isn't used, but it would delegate the folio (page) with the prerequisite that
the folio belongs to the private address space.
I guess that kvm_arch_gmem_prepare() is skipped here because we have the assumption that
the folio belongs to the shared address space? However, this assumption isn't always
true. We probably need to ensure the folio range is really belonging to the shared
address space by poking kvm->mem_attr_array, which can be modified by VMM through
ioctl KVM_SET_MEMORY_ATTRIBUTES.
This series only supports shared memory, and the idea is not to use
the attributes to check. We ensure that only certain VM types can set
the flag (e.g., VM_TYPE_DEFAULT and KVM_X86_SW_PROTECTED_VM).
In the patch series that builds on it, with in-place conversion
between private and shared, we do add a check that the memory faulted
in is in-fact shared.
Ok, thanks for your clarification. I plan to review that series, but not
getting a chance yet. Right, it's sensible to limit the capability of modifying
page's attribute (private vs shared) to the particular machine types since
the whole feature (restricted mmap and in-place conversion) is applicable
to particular machine types. I can understand KVM_X86_SW_PROTECTED_VM
(similar to pKVM) needs the feature, but I don't understand why VM_TYPE_DEFAULT
needs the feature. I guess we may want to use guest-memfd as to tmpfs or
shmem, meaning all the address space associated with a guest-memfd is shared,
but without the corresponding private space pointed by struct kvm_userspace_memory_region2
::userspace_addr. Instead, the 'userspace_addr' will be mmap(guest-memfd) from
VMM's perspective if I'm correct.
Thanks,
Gavin
Thanks,
/fuad
+ vmf->page = folio_file_page(folio, vmf->pgoff);
+
+out_folio:
+ if (ret != VM_FAULT_LOCKED) {
+ folio_unlock(folio);
+ folio_put(folio);
+ }
+
+out_filemap:
+ filemap_invalidate_unlock_shared(inode->i_mapping);
+
+ return ret;
+}
+
Thanks,
Gavin