Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4

Christian Brauner <brauner@xxxxxxxxxx> · Thu, 31 Jul 2025 12:13:01 +0200

On Wed, Jul 23, 2025 at 11:04:43AM -0700, Darrick J. Wong wrote:
> On Wed, Jul 23, 2025 at 03:05:12PM +0200, Christian Brauner wrote:
> > On Fri, Jul 18, 2025 at 12:31:16PM -0700, Darrick J. Wong wrote:
> > > On Fri, Jul 18, 2025 at 01:55:48PM +0200, Amir Goldstein wrote:
> > > > On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > DO NOT MERGE THIS, STILL!
> > > > > >
> > > > > > This is the third request for comments of a prototype to connect the
> > > > > > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > > > > > files whose contents persist to locally attached storage devices.
> > > > > >
> > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > server process.
> > > > > >
> > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > core code.  Eeeugh.
> > > > > >
> > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > > > > > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > > > > > writeback is now a directio write.  The fuse server is now able to
> > > > > > upsert mappings into the kernel for cached access (== zero upcalls for
> > > > > > rereads and pure overwrites!) and the iomap cache revalidation code
> > > > > > works.
> > > > > >
> > > > > > With this RFC, I am able to show that it's possible to build a fuse
> > > > > > server for a real filesystem (ext4) that runs entirely in userspace yet
> > > > > > maintains most of its performance.  At this stage I still get about 95%
> > > > > > of the kernel ext4 driver's streaming directio performance on streaming
> > > > > > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > > > > > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > > > > > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > > > > > for more details.  Unwritten extent conversions on random direct writes
> > > > > > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > > > > > overhead.  And that's with debugging turned on!
> > > > > >
> > > > > > These items have been addressed since the first RFC:
> > > > > >
> > > > > > 1. The iomap cookie validation is now present, which avoids subtle races
> > > > > > between pagecache zeroing and writeback on filesystems that support
> > > > > > unwritten and delalloc mappings.
> > > > > >
> > > > > > 2. Mappings can be cached in the kernel for more speed.
> > > > > >
> > > > > > 3. iomap supports inline data.
> > > > > >
> > > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > > > > > to be as easy as creating a new ->getattr_iflags callback so that the
> > > > > > fuse server can set fuse_attr::flags.
> > > > > >
> > > > > > 5. statx and syncfs work on iomap filesystems.
> > > > > >
> > > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > > > > > is enabled.
> > > > > >
> > > > > > 7. The ext4 shutdown ioctl is now supported.
> > > > > >
> > > > > > There are some major warts remaining:
> > > > > >
> > > > > > a. ext4 doesn't support out of place writes so I don't know if that
> > > > > > actually works correctly.
> > > > > >
> > > > > > b. iomap is an inode-based service, not a file-based service.  This
> > > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > > to index its incore inode, so we have to pass those too so that
> > > > > > notifications work properly.  This is related to #3 below:
> > > > > >
> > > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > > > > > because the upper level libfuse likes to abstract kernel nodeids with
> > > > > > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > > > > > As a result, a hardlinked file results in two distinct struct inodes in
> > > > > > the kernel, which completely breaks iomap's locking model.  I will have
> > > > > > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > > > > > but on the plus side there will be far less path lookup overhead.
> > > > > >
> > > > > > d. There are too many changes to the IO manager in libext2fs because I
> > > > > > built things needed to stage the direct/buffered IO paths separately.
> > > > > > These are now unnecessary but I haven't pulled them out yet because
> > > > > > they're sort of useful to verify that iomap file IO never goes through
> > > > > > libext2fs except for inline data.
> > > > > >
> > > > > > e. If we're going to use fuse servers as "safe" replacements for kernel
> > > > > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > > > > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > > > > > We also need to disable the OOM killer(s) for fuse servers because you
> > > > > > don't want filesystems to unmount abruptly.
> > > > > >
> > > > > > f. How do we maximally contain the fuse server to have safe filesystem
> > > > > > mounts?  It's very convenient to use systemd services to configure
> > > > > > isolation declaratively, but fuse2fs still needs to be able to open
> > > > > > /dev/fuse, the ext4 block device, and call mount() in the shared
> > > > > > namespace.  This prevents us from using most of the stronger systemd
> > > > >
> > > > > I'm happy to help you here.
> > > > >
> > > > > First, I think using a character device for namespaced drivers is always
> > > > > a mistake. FUSE predates all that ofc. They're incredibly terrible for
> > > > > delegation because of devtmpfs not being namespaced as well as devices
> > > > > in general. And having device nodes on anything other than tmpfs is just
> > > > > wrong (TM).
> > > > >
> > > > > In systemd I ultimately want a bpf LSM program that prevents the
> > > > > creation of device nodes outside of tmpfs. They don't belong on
> > > > > persistent storage imho. But anyway, that's besides the point.
> > > > >
> > > > > Opening the block device should be done by systemd-mountfsd but I think
> > > > > /dev/fuse should really be openable by the service itself.
> > > 
> > > /me slaps his head and remembers that fsopen/fsconfig/fsmount exist.
> > > Can you pass an fsopen fd to an unprivileged process and have that
> > > second process call fsmount?
> > 
> > Yes, but remember that at some point you must call
> > fsconfig(FSCONFIG_CMD_CREATE) to create the superblock. On block based
> > fses that requires CAP_SYS_ADMIN so that has to be done by the
> > privielged process. All the rest can be done by the unprivileged process
> > though. That's exactly how bpf tokens work.
> 
> Hrm.  Assuming the fsopen mount sequence is still:
> 
> 	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> 	fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> 	...
> 	fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> 	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> 
> Then I guess whoever calls fsconfig(FSCONFIG_CMD_CREATE) needs
> CAP_SYS_ADMIN; and they have to be running in the desired fs namespace
> for move_mount() to have the intended effect.

Yes-ish.

At fsopen() time the user namespace of the caller is recorded in
fs_context->user_ns. If the filesystems is mountable inside of a user
namespace then fs_context->user_ns will be used to perform the
CAP_SYS_ADMIN check.

For filesystems that aren't mountable inside of user namespaces (ext4,
xfs, ...) the fs_context->user_ns is ignored in mount_capable() and
global CAP_SYS_ADMIN is required. sget_fc() and friends flat out refuse
to mount a filesystem with a non-initial userns if it's not marked as
mountable. That used to be possible but it's an invitation for extremely
subtle bugs and you gain control over the superblock itself.

TL;DR the user namespace the superblock belongs to is usually determined
at fsopen() time.

> 
> Can two processes share the same fsopen fd?  If so then systemd-mountfsd

Yes, they can share and it's synchronized.

> could pass the fsopen fd to the fuse server (whilst retaining its own
> copy).  The fuse server could do its own mount option parsing, call

Yes, systemd-mountfsd already does passing like that.

> FSCONFIG_SET_* on the fd, and then signal back to systemd-mountfsd to do
> the create/fsmount/move_mount part.

Yes.

> 
> The systemd-mountfsd would have to be running in desired fs namespace
> and with sufficient privileges to open block devices, but I'm guessing
> that's already a requirement?

Yes, systemd-mountfsd is a system level service running in the initial
set of namespaces and interacting with systemd-nsresourced (namespace
related stuff). It can obviously also create helper to setns() into
various namespaces if required. 

> 
> > > If so, then it would be more convenient if mount.safe/systemd-mountfsd
> > > could pass open fds for /dev/fuse fsopen then the fuse server wouldn't

Yes, I would think so.

> > 
> > Yes, that would work.
> 
> Oh goody :)
> 
> > > need any special /dev access at all.  I think then the fuse server's
> > > service could have:
> > > 
> > > DynamicUser=true
> > > ProtectSystem=true
> > > ProtectHome=true
> > > PrivateTmp=true
> > > PrivateDevices=true
> > > DevicePolicy=strict
> > > 
> > > (I think most of those are redundant with DynamicUser=true but a lot of
> > > my systemd-fu is paged out ATM.)
> > > 
> > > My goal here is extreme containment -- the code doing the fs metadata
> > > parsing has no privileges, no write access except to the fds it was
> > > given, no network access, and no ability to read anything outside the
> > > root filesystem.  Then I can get back to writing buffer
> > > overflows^W^Whigh quality filesystem code in peace.
> > 
> > Yeah, sounds about right.
> > 
> > > 
> > > > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> > > > > whiteouts. That means you can do mknod() in the container to create
> > > > > /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> > > > > bat so that containers can only do this on their private tmpfs mount at
> > > > > /dev.)
> > > > >
> > > > > The downside of this would be to give unprivileged containers access to
> > > > > FUSE by default. I don't think that's a problem per se but it is a uapi
> > > > > change.
> > > 
> > > Yeah, that is a new risk.  It's still better than metadata parsing
> > > within the kernel address space ... though who knows how thoroughly fuse
> > > has been fuzzed by syzbot :P
> > > 
> > > > > Let me think a bit about alternatives. I have one crazy idea but I'm not
> > > > > sure enough about it to spill it.
> > > 
> > > Please do share, #f is my crazy unbaked idea. :)
> > > 
> > > > I don't think there is a hard requirement for the fuse fd to be opened from
> > > > a device driver.
> > > > With fuse io_uring communication, the open fd doesn't even need to do io.
> > > > 
> > > > > > protections because they tend to run in a private mount namespace with
> > > > > > various parts of the filesystem either hidden or readonly.
> > > > > >
> > > > > > In theory one could design a socket protocol to pass mount options,
> > > > > > block device paths, fds, and responsibility for the mount() call between
> > > > > > a mount helper and a service:
> > > > >
> > > > > This isn't a problem really. This should just be an extension to
> > > > > systemd-mountfsd.
> > > 
> > > I suppose mount.safe could very well call systemd-mount to go do all the
> > > systemd-related service setup, and that would take care of udisks as
> > > well.
> > 
> > The ultimate goal is to teach mount(8)/libmount to use that daemon when
> > it's available. Because that would just make unprivileged mounting work
> > without userspace noticing anything.
> 
> That sounds really neat. :)
> 
> --D