Hi everyone, Do not merge this, still!! This is the fourth request for comments of a prototype to connect the Linux fuse driver to fs-iomap for regular file IO operations to and from files whose contents persist to locally attached storage devices. Why would you want to do that? Most filesystem drivers are seriously vulnerable to metadata parsing attacks, as syzbot has shown repeatedly over almost a decade of its existence. Faulty code can lead to total kernel compromise, and I think there's a very strong incentive to move all that parsing out to userspace where we can containerize the fuse server process. willy's folios conversion project (and to a certain degree RH's new mount API) have also demonstrated that treewide changes to the core mm/pagecache/fs code are very very difficult to pull off and take years because you have to understand every filesystem's bespoke use of that core code. Eeeugh. The fuse command plumbing is very simple -- the ->iomap_begin, ->iomap_end, and iomap ->ioend calls within iomap are turned into upcalls to the fuse server via a trio of new fuse commands. Pagecache writeback is now a directio write. The fuse server is now able to upsert mappings into the kernel for cached access (== zero upcalls for rereads and pure overwrites!) and the iomap cache revalidation code works. With this RFC, I am able to show that it's possible to build a fuse server for a real filesystem (ext4) that runs entirely in userspace yet maintains most of its performance. At this stage I still get about 95% of the kernel ext4 driver's streaming directio performance on streaming IO, and 110% of its streaming buffered IO performance. Random buffered IO is about 85% as fast as the kernel. Random direct IO is about 80% as fast as the kernel; see the cover letter for the fuse2fs iomap changes for more details. Unwritten extent conversions on random direct writes are especially painful for fuse+iomap (~90% more overhead) due to upcall overhead. And that's with (now dynamic) debugging turned on! These items have been addressed since the third RFC: 1. fuse2fs has been forked into fuse4fs, which now talks to the low level fuse interface. This avoids all the path walking that the high level fuse library provides, which dramatically improves the performance of fuse4fs. fstests runs in half the time now. Many thanks to Amir Goldstein for giving me a rough draft of the conversion! 2. I simplified the configuration protocols -- now there's a per-fs bit to enable any iomap, and a per-inode bit to enable iomap on a specific file. Registration of iomap devices now uses the backing fd registration interface. 3. You can now specify the root nodeid for any fuse mount. 4. Atomic writes are working, at least for single fsblocks. 5. I've ported the cache implementation from xfsprogs to e2fsprogs libsupport, so the inode and buffer caches can now dynamically grow to support larger working sets. No more fixed-size caches! 6. Cleaned up the kernel/libfuse ABI quite a bit. 7. fstests passes 97% of the tests that run, when iomap is enabled! Only 93% pass when iomap is disabled, and I think that's due to some bugs in the ACL and mode handling code. There are some major warts remaining: a. I've a /much/ clearer picture of how one might containerize a filesystem server, thanks to a lot of input from Christian Brauner in response to v3. I think I have enough pieces to try setting up a fd-passing interface into a systemd service ... but I haven't actually written any of it yet. b. fsdax isn't implemented. I think I'm going to work on this for RFC v5 to see if we can simplify the file mapping handling in famfs. If not, then everyone else gets fsdax for free. c. ext4 doesn't support out of place writes so I don't know if that actually works correctly. d. I've not yet consolidated struct fuse_inode, so the iomap gunk still eats rather a lot of space per inode. e. fuse2fs doesn't support the ext4 journal. Urk. f. There's a VERY large quantity of fuse2fs improvements that need to be applied before we get to the fuse-iomap parts. I'm not sending these (or the fstests changes) to keep the size of the patchbomb at "unreasonably large". :P I'll work on these in August/Steptember, but for now here's an unmergeable RFC to start some discussion. --Darrick