Subject: famfs: port into fuse This is the initial RFC for the fabric-attached memory file system (famfs) integration into fuse. In order to function, this requires a related patch to libfuse [1] and the famfs user space [2]. This RFC is mainly intended to socialize the approach and get feedback from the fuse developers and maintainers. There is some dax work that needs to be done before this should be merged (see the "poisoned page|folio problem" below). This patch set fully works with Linux 6.14 -- passing all existing famfs smoke and unit tests -- and I encourage existing famfs users to test it. This is really two patch sets mashed up: * The patches with the dev_dax_iomap: prefix fill in missing functionality for devdax to host an fs-dax file system. * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively unchanged since last year. Because this is not ready to merge yet, I have felt free to leave some debug prints in place because we still find them useful; those will be cleaned up in a subsequent revision. Famfs Overview Famfs exposes shared memory as a file system. Famfs consumes shared memory from dax devices, and provides memory-mappable files that map directly to the memory - no page cache involvement. Famfs differs from conventional file systems in fs-dax mode, in that it handles in-memory metadata in a sharable way (which begins with never caching dirty shared metadata). Famfs started as a standalone file system [3,4], but the consensus at LSFMM 2024 [5] was that it should be ported into fuse - and this RFC is the first public evidence that I've been working on that. The key performance requirement is that famfs must resolve mapping faults without upcalls. This is achieved by fully caching the file-to-devdax metadata for all active files. This is done via two fuse client/server message/response pairs: GET_FMAP and GET_DAXDEV. Famfs remains the first fs-dax file system that is backed by devdax rather than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups). Notes * Once the dev_dax_iomap patches land, I suspect it may make sense for virtiofs to update to use the improved interface. * I'm currently maintaining compatibility between the famfs user space and both the standalone famfs kernel file system and this new fuse implementation. In the near future I'll be running performance comparisons and sharing them - but there is no reason to expect significant degradation with fuse, since famfs caches entire "fmaps" in the kernel to resolve faults with no upcalls. This patch has a bit too much debug turned on to to that testing quite yet. A branch * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV. * When a file is looked up in a famfs mount, the LOOKUP is followed by a GET_FMAP message and response. The "fmap" is the full file-to-dax mapping, allowing the fuse/famfs kernel code to handle read/write/fault without any upcalls. * After each GET_FMAP, the fmap is checked for extents that reference previously-unknown daxdevs. Each such occurence is handled with a GET_DAXDEV message and response. * Daxdevs are stored in a table (which might become an xarray at some point). When entries are added to the table, we acquire exclusive access to the daxdev via the fs_dax_get() call (modeled after how fs-dax handles this with pmem devices). famfs provides holder_operations to devdax, providing a notification path in the event of memory errors. * If devdax notifies famfs of memory errors on a dax device, famfs currently bocks all subsequent accesses to data on that device. The recovery is to re-initialize the memory and file system. Famfs is memory, not storage... * Because famfs uses backing (devdax) devices, only privileged mounts are supported. * The famfs kernel code never accesses the memory directly - it only facilitates read, write and mmap on behalf of user processes. As such, the RAS of the shared memory affects applications, but not the kernel. * Famfs has backing device(s), but they are devdax (char) rather than block. Right now there is no way to tell the vfs layer that famfs has a char backing device (unless we say it's block, but it's not). Currently we use the standard anonymous fuse fs_type - but I'm not sure that's ultimately optimal (thoughts?) The "poisoned page|folio problem" * Background: before doing a kernel mount, the famfs user space [2] validates the superblock and log. This is done via raw mmap of the primary devdax device. If valid, the file system is mounted, and the superblock and log get exposed through a pair of files (.meta/.superblock and .meta/.log) - because we can't be using raw device mmap when a file system is mounted on the device. But this exposes a devdax bug and warning... * Pages that have been memory mapped via devdax are left in a permanently problematic state. Devdax sets page|folio->mapping when a page is accessed via raw devdax mmap (as famfs does before mount), but never cleans it up. When the pages of the famfs superblock and log are accessed via the "meta" files after mount, we see a WARN_ONCE() in dax_insert_entry(), which notices that page|folio->mapping is still set. I intend to address this prior to asking for the famfs patches to be merged. * Alistair Popple's recent dax patch series [6], which has been merged for 6.15, addresses some dax issues, but sadly does not fix the poisoned page|folio problem - its enhanced refcount checking turns the warning into an error. * This 6.14 patch set disables the warning; a proper fix will be required for famfs to work at all in 6.15. Dan W. and I are actively discussing how to do this properly... * In terms of the correct functionality of famfs, the warning can be ignored. References [1] - https://github.com/libfuse/libfuse/pull/1200 [2] - https://github.com/cxl-micron-reskit/famfs [3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@xxxxxxxxxx/ [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@xxxxxxxxxx/ [5] - https://lwn.net/Articles/983105/ [6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@xxxxxxxxxx/ John Groves (19): dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage dev_dax_iomap: Save the kva from memremap dev_dax_iomap: Add dax_operations for use by fs-dax on devdax dev_dax_iomap: export dax_dev_get() dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c famfs_fuse: magic.h: Add famfs magic numbers famfs_fuse: Kconfig famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ famfs_fuse: Basic fuse kernel ABI enablement for famfs famfs_fuse: Basic famfs mount opts famfs_fuse: Plumb the GET_FMAP message/response famfs_fuse: Create files with famfs fmaps famfs_fuse: GET_DAXDEV message and daxdev_table famfs_fuse: Plumb dax iomap and fuse read/write/mmap famfs_fuse: Add holder_operations for dax notify_failure() famfs_fuse: Add famfs metadata documentation famfs_fuse: Add documentation famfs_fuse: (ignore) debug cruft Documentation/filesystems/famfs.rst | 142 ++++ Documentation/filesystems/index.rst | 1 + MAINTAINERS | 10 + drivers/dax/Kconfig | 6 + drivers/dax/bus.c | 144 +++- drivers/dax/dax-private.h | 1 + drivers/dax/device.c | 38 +- drivers/dax/super.c | 33 +- fs/dax.c | 1 - fs/fuse/Kconfig | 13 + fs/fuse/Makefile | 4 +- fs/fuse/dev.c | 61 ++ fs/fuse/dir.c | 74 +- fs/fuse/famfs.c | 1105 +++++++++++++++++++++++++++ fs/fuse/famfs_kfmap.h | 166 ++++ fs/fuse/file.c | 27 +- fs/fuse/fuse_i.h | 67 +- fs/fuse/inode.c | 49 +- fs/fuse/iomode.c | 2 +- fs/namei.c | 1 + include/linux/dax.h | 6 + include/uapi/linux/fuse.h | 63 ++ include/uapi/linux/magic.h | 2 + 23 files changed, 1973 insertions(+), 43 deletions(-) create mode 100644 Documentation/filesystems/famfs.rst create mode 100644 fs/fuse/famfs.c create mode 100644 fs/fuse/famfs_kfmap.h base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557 -- 2.49.0