On Fri, Jul 18, 2025 at 10:54 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Thu, Jul 17, 2025 at 04:10:38PM -0700, Darrick J. Wong wrote: > > Hi everyone, > > > > DO NOT MERGE THIS, STILL! > > > > This is the third request for comments of a prototype to connect the > > Linux fuse driver to fs-iomap for regular file IO operations to and from > > files whose contents persist to locally attached storage devices. > > > > Why would you want to do that? Most filesystem drivers are seriously > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > over almost a decade of its existence. Faulty code can lead to total > > kernel compromise, and I think there's a very strong incentive to move > > all that parsing out to userspace where we can containerize the fuse > > server process. > > > > willy's folios conversion project (and to a certain degree RH's new > > mount API) have also demonstrated that treewide changes to the core > > mm/pagecache/fs code are very very difficult to pull off and take years > > because you have to understand every filesystem's bespoke use of that > > core code. Eeeugh. > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > ->iomap_end, and iomap ->ioend calls within iomap are turned into > > upcalls to the fuse server via a trio of new fuse commands. Pagecache > > writeback is now a directio write. The fuse server is now able to > > upsert mappings into the kernel for cached access (== zero upcalls for > > rereads and pure overwrites!) and the iomap cache revalidation code > > works. > > > > With this RFC, I am able to show that it's possible to build a fuse > > server for a real filesystem (ext4) that runs entirely in userspace yet > > maintains most of its performance. At this stage I still get about 95% > > of the kernel ext4 driver's streaming directio performance on streaming > > IO, and 110% of its streaming buffered IO performance. Random buffered > > IO is about 85% as fast as the kernel. Random direct IO is about 80% as > > fast as the kernel; see the cover letter for the fuse2fs iomap changes > > for more details. Unwritten extent conversions on random direct writes > > are especially painful for fuse+iomap (~90% more overhead) due to upcall > > overhead. And that's with debugging turned on! > > > > These items have been addressed since the first RFC: > > > > 1. The iomap cookie validation is now present, which avoids subtle races > > between pagecache zeroing and writeback on filesystems that support > > unwritten and delalloc mappings. > > > > 2. Mappings can be cached in the kernel for more speed. > > > > 3. iomap supports inline data. > > > > 4. I can now turn on fuse+iomap on a per-inode basis, which turned out > > to be as easy as creating a new ->getattr_iflags callback so that the > > fuse server can set fuse_attr::flags. > > > > 5. statx and syncfs work on iomap filesystems. > > > > 6. Timestamps and ACLs work the same way they do in ext4/xfs when iomap > > is enabled. > > > > 7. The ext4 shutdown ioctl is now supported. > > > > There are some major warts remaining: > > > > a. ext4 doesn't support out of place writes so I don't know if that > > actually works correctly. > > > > b. iomap is an inode-based service, not a file-based service. This > > means that we /must/ push ext2's inode numbers into the kernel via > > FUSE_GETATTR so that it can report those same numbers back out through > > the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid > > to index its incore inode, so we have to pass those too so that > > notifications work properly. This is related to #3 below: > > > > c. Hardlinks and iomap are not possible for upper-level libfuse clients > > because the upper level libfuse likes to abstract kernel nodeids with > > its own homebrew dirent/inode cache, which doesn't understand hardlinks. > > As a result, a hardlinked file results in two distinct struct inodes in > > the kernel, which completely breaks iomap's locking model. I will have > > to rewrite fuse2fs for the lowlevel libfuse library to make this work, > > but on the plus side there will be far less path lookup overhead. > > > > d. There are too many changes to the IO manager in libext2fs because I > > built things needed to stage the direct/buffered IO paths separately. > > These are now unnecessary but I haven't pulled them out yet because > > they're sort of useful to verify that iomap file IO never goes through > > libext2fs except for inline data. > > > > e. If we're going to use fuse servers as "safe" replacements for kernel > > filesystem drivers, we need to be able to set PF_MEMALLOC_NOFS so that > > fuse2fs memory allocations (in the kernel) don't push pagecache reclaim. > > We also need to disable the OOM killer(s) for fuse servers because you > > don't want filesystems to unmount abruptly. > > > > f. How do we maximally contain the fuse server to have safe filesystem > > mounts? It's very convenient to use systemd services to configure > > isolation declaratively, but fuse2fs still needs to be able to open > > /dev/fuse, the ext4 block device, and call mount() in the shared > > namespace. This prevents us from using most of the stronger systemd > > I'm happy to help you here. > > First, I think using a character device for namespaced drivers is always > a mistake. FUSE predates all that ofc. They're incredibly terrible for > delegation because of devtmpfs not being namespaced as well as devices > in general. And having device nodes on anything other than tmpfs is just > wrong (TM). > > In systemd I ultimately want a bpf LSM program that prevents the > creation of device nodes outside of tmpfs. They don't belong on > persistent storage imho. But anyway, that's besides the point. > > Opening the block device should be done by systemd-mountfsd but I think > /dev/fuse should really be openable by the service itself. > > So we can try and allowlist /dev/fuse in vfs_mknod() similar to > whiteouts. That means you can do mknod() in the container to create > /dev/fuse (Personally, I would even restrict this to tmpfs right off the > bat so that containers can only do this on their private tmpfs mount at > /dev.) > > The downside of this would be to give unprivileged containers access to > FUSE by default. I don't think that's a problem per se but it is a uapi > change. > > Let me think a bit about alternatives. I have one crazy idea but I'm not > sure enough about it to spill it. > I don't think there is a hard requirement for the fuse fd to be opened from a device driver. With fuse io_uring communication, the open fd doesn't even need to do io. > > protections because they tend to run in a private mount namespace with > > various parts of the filesystem either hidden or readonly. > > > > In theory one could design a socket protocol to pass mount options, > > block device paths, fds, and responsibility for the mount() call between > > a mount helper and a service: > > This isn't a problem really. This should just be an extension to > systemd-mountfsd. This is relevant not only to systemd env. I have been experimenting with this mount helper service to mount fuse fs inside an unprivileged kubernetes container, where opening of /dev/fuse is restricted by LSM policy: https://github.com/pfnet-research/meta-fuse-csi-plugin?tab=readme-ov-file#fusermount3-proxy-modified-fusermount3-approach Thanks, Amir.