Re: [RFC v3] fuse: use fs-iomap for better performance so we can containerize ext4

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 18 Jul 2025 13:55:48 +0200

On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote:
> > Hi everyone,
> >
> > DO NOT MERGE THIS, STILL!
> >
> > This is the third request for comments of a prototype to connect the
> > Linux fuse driver to fs-iomap for regular file IO operations to and from
> > files whose contents persist to locally attached storage devices.
> >
> > Why would you want to do that?  Most filesystem drivers are seriously
> > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > over almost a decade of its existence.  Faulty code can lead to total
> > kernel compromise, and I think there's a very strong incentive to move
> > all that parsing out to userspace where we can containerize the fuse
> > server process.
> >
> > willy's folios conversion project (and to a certain degree RH's new
> > mount API) have also demonstrated that treewide changes to the core
> > mm/pagecache/fs code are very very difficult to pull off and take years
> > because you have to understand every filesystem's bespoke use of that
> > core code.  Eeeugh.
> >
> > The fuse command plumbing is very simple -- the ->iomap_begin,
> > ->iomap_end, and iomap ->ioend calls within iomap are turned into
> > upcalls to the fuse server via a trio of new fuse commands.  Pagecache
> > writeback is now a directio write.  The fuse server is now able to
> > upsert mappings into the kernel for cached access (== zero upcalls for
> > rereads and pure overwrites!) and the iomap cache revalidation code
> > works.
> >
> > With this RFC, I am able to show that it's possible to build a fuse
> > server for a real filesystem (ext4) that runs entirely in userspace yet
> > maintains most of its performance.  At this stage I still get about 95%
> > of the kernel ext4 driver's streaming directio performance on streaming
> > IO, and 110% of its streaming buffered IO performance.  Random buffered
> > IO is about 85% as fast as the kernel.  Random direct IO is about 80% as
> > fast as the kernel; see the cover letter for the fuse2fs iomap changes
> > for more details.  Unwritten extent conversions on random direct writes
> > are especially painful for fuse+iomap (~90% more overhead) due to upcall
> > overhead.  And that's with debugging turned on!
> >
> > These items have been addressed since the first RFC:
> >
> > 1. The iomap cookie validation is now present, which avoids subtle races
> > between pagecache zeroing and writeback on filesystems that support
> > unwritten and delalloc mappings.
> >
> > 2. Mappings can be cached in the kernel for more speed.
> >
> > 3. iomap supports inline data.
> >
> > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out
> > to be as easy as creating a new ->getattr_iflags callback so that the
> > fuse server can set fuse_attr::flags.
> >
> > 5. statx and syncfs work on iomap filesystems.
> >
> > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap
> > is enabled.
> >
> > 7. The ext4 shutdown ioctl is now supported.
> >
> > There are some major warts remaining:
> >
> > a. ext4 doesn't support out of place writes so I don't know if that
> > actually works correctly.
> >
> > b. iomap is an inode-based service, not a file-based service.  This
> > means that we /must/ push ext2's inode numbers into the kernel via
> > FUSE_GETATTR so that it can report those same numbers back out through
> > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > to index its incore inode, so we have to pass those too so that
> > notifications work properly.  This is related to #3 below:
> >
> > c. Hardlinks and iomap are not possible for upper-level libfuse clients
> > because the upper level libfuse likes to abstract kernel nodeids with
> > its own homebrew dirent/inode cache, which doesn't understand hardlinks.
> > As a result, a hardlinked file results in two distinct struct inodes in
> > the kernel, which completely breaks iomap's locking model.  I will have
> > to rewrite fuse2fs for the lowlevel libfuse library to make this work,
> > but on the plus side there will be far less path lookup overhead.
> >
> > d. There are too many changes to the IO manager in libext2fs because I
> > built things needed to stage the direct/buffered IO paths separately.
> > These are now unnecessary but I haven't pulled them out yet because
> > they're sort of useful to verify that iomap file IO never goes through
> > libext2fs except for inline data.
> >
> > e. If we're going to use fuse servers as "safe" replacements for kernel
> > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that
> > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim.
> > We also need to disable the OOM killer(s) for fuse servers because you
> > don't want filesystems to unmount abruptly.
> >
> > f. How do we maximally contain the fuse server to have safe filesystem
> > mounts?  It's very convenient to use systemd services to configure
> > isolation declaratively, but fuse2fs still needs to be able to open
> > /dev/fuse, the ext4 block device, and call mount() in the shared
> > namespace.  This prevents us from using most of the stronger systemd
>
> I'm happy to help you here.
>
> First, I think using a character device for namespaced drivers is always
> a mistake. FUSE predates all that ofc. They're incredibly terrible for
> delegation because of devtmpfs not being namespaced as well as devices
> in general. And having device nodes on anything other than tmpfs is just
> wrong (TM).
>
> In systemd I ultimately want a bpf LSM program that prevents the
> creation of device nodes outside of tmpfs. They don't belong on
> persistent storage imho. But anyway, that's besides the point.
>
> Opening the block device should be done by systemd-mountfsd but I think
> /dev/fuse should really be openable by the service itself.
>
> So we can try and allowlist /dev/fuse in vfs_mknod() similar to
> whiteouts. That means you can do mknod() in the container to create
> /dev/fuse (Personally, I would even restrict this to tmpfs right off the
> bat so that containers can only do this on their private tmpfs mount at
> /dev.)
>
> The downside of this would be to give unprivileged containers access to
> FUSE by default. I don't think that's a problem per se but it is a uapi
> change.
>
> Let me think a bit about alternatives. I have one crazy idea but I'm not
> sure enough about it to spill it.
>

I don't think there is a hard requirement for the fuse fd to be opened from
a device driver.
With fuse io_uring communication, the open fd doesn't even need to do io.

> > protections because they tend to run in a private mount namespace with
> > various parts of the filesystem either hidden or readonly.
> >
> > In theory one could design a socket protocol to pass mount options,
> > block device paths, fds, and responsibility for the mount() call between
> > a mount helper and a service:
>
> This isn't a problem really. This should just be an extension to
> systemd-mountfsd.

This is relevant not only to systemd env.

I have been experimenting with this mount helper service to mount fuse fs
inside an unprivileged kubernetes container, where opening of /dev/fuse
is restricted by LSM policy:

https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach

Thanks,
Amir.