On 10.09.25 14:14, Pedro Falcato wrote:
On Tue, Aug 19, 2025 at 06:03:54PM -0700, Anthony Yznaga wrote:
From: Khalid Aziz <khalid@xxxxxxxxxx>
Add a pseudo filesystem that contains files and page table sharing
information that enables processes to share page table entries.
This patch adds the basic filesystem that can be mounted, a
CONFIG_MSHARE option to enable the feature, and documentation.
Signed-off-by: Khalid Aziz <khalid@xxxxxxxxxx>
Signed-off-by: Anthony Yznaga <anthony.yznaga@xxxxxxxxxx>
---
Documentation/filesystems/index.rst | 1 +
Documentation/filesystems/msharefs.rst | 96 +++++++++++++++++++++++++
include/uapi/linux/magic.h | 1 +
mm/Kconfig | 11 +++
mm/Makefile | 4 ++
mm/mshare.c | 97 ++++++++++++++++++++++++++
6 files changed, 210 insertions(+)
create mode 100644 Documentation/filesystems/msharefs.rst
create mode 100644 mm/mshare.c
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 11a599387266..dcd6605eb228 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -102,6 +102,7 @@ Documentation for filesystem implementations.
fuse-passthrough
inotify
isofs
+ msharefs
nilfs2
nfs/index
ntfs3
diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..3e5b7d531821
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+Msharefs - A filesystem to support shared page tables
+=====================================================
+
+What is msharefs?
+-----------------
+
+msharefs is a pseudo filesystem that allows multiple processes to
+share page table entries for shared pages. To enable support for
+msharefs the kernel must be compiled with CONFIG_MSHARE set.
+
+msharefs is typically mounted like this::
+
+ mount -t msharefs none /sys/fs/mshare
+
+A file created on msharefs creates a new shared region where all
+processes mapping that region will map it using shared page table
+entries. Once the size of the region has been established via
+ftruncate() or fallocate(), the region can be mapped into processes
+and ioctls used to map and unmap objects within it. Note that an
+msharefs file is a control file and accessing mapped objects within
+a shared region through read or write of the file is not permitted.
+
Welp. I really really don't like this API.
I assume this has been discussed previously, but why do we need a new
magical pseudofs mounted under some random /sys directory?
But, ok, assuming we're thinking about something hugetlbfs like, that's not too
bad, and programs already know how to use it.
+How to use mshare
+-----------------
+
+Here are the basic steps for using mshare:
+
+ 1. Mount msharefs on /sys/fs/mshare::
+
+ mount -t msharefs msharefs /sys/fs/mshare
+
+ 2. mshare regions have alignment and size requirements. Start
+ address for the region must be aligned to an address boundary and
+ be a multiple of fixed size. This alignment and size requirement
+ can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
+ which returns a number in text format. mshare regions must be
+ aligned to this boundary and be a multiple of this size.
+
I don't see why size and alignment needs to be taken into consideration by
userspace. You can simply establish a mapping and pad it out.
+ 3. For the process creating an mshare region:
+
+ a. Create a file on /sys/fs/mshare, for example::
+
+ fd = open("/sys/fs/mshare/shareme",
+ O_RDWR|O_CREAT|O_EXCL, 0600);
Ok, makes sense.
+
+ b. Establish the size of the region::
+
+ fallocate(fd, 0, 0, BUF_SIZE);
+
+ or::
+
+ ftruncate(fd, BUF_SIZE);
+
Yep.
+ c. Map some memory in the region::
+
+ struct mshare_create mcreate;
+
+ mcreate.region_offset = 0;
+ mcreate.size = BUF_SIZE;
+ mcreate.offset = 0;
+ mcreate.prot = PROT_READ | PROT_WRITE;
+ mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
+ mcreate.fd = -1;
+
+ ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
Why?? Do you want to map mappings in msharefs files, that can themselves be
mapped? Why do we need an ioctl here?
Really, this feature seems very overengineered. If you want to go the fs route,
doing a new pseudofs that's just like hugetlb, but without the hugepages, sounds
like a decent idea. Or enhancing tmpfs to actually support this kind of stuff.
Or properly doing a syscall that can try to attach the page-table-sharing
property to random VMAs.
But I'm wholly opposed to the idea of "mapping a file that itself has more
mappings, mappings which you establish using a magic filesystem and ioctls".
I don't remember the history (it's been a while) but there was this
interest of
(a) Sharing page tables for smaller files (not just PUD size etc.)
(b) Supporting also ordinary file systems, not just tmpfs
(c) Having a way to update protection of parts of a mapping and
immediately have it visible to everyone mapping that area.
In the past, I raised that some VM use cases around virtio-fs would be
interested in having a "VMA container" that can be updated by the parent
QEMU process, and what gets mapped in there would be immediately visible
to the other processes.
I recall that initially I pushed for just generalizing the support for
shared page tables so it could be used for other file systems. I recall
problems around that, likely around protection changes etc.
So current mshare really is the idea of having a (let's call it) VMA
container that can be mapped into processes where all processes will
observe changes performed by other processes.
I agree that it's complicated, and the semantics are very, very, very weird.
--
Cheers
David / dhildenb