On Thu, Mar 27, 2025 at 03:32:42AM -0700, Christoph Hellwig wrote: > On Tue, Mar 25, 2025 at 06:49:24PM -0700, Catherine Hoang wrote: > > Hi all, > > > > This series is an effort to begin removing buffer heads from ext2. > > Why is that desirable? struct buffer_head is a mismash of things -- originally it was a landing place for the old buffer cache, right? So it has the necessary things like a pointer to a memory page, the disk address, a length, buffer state flags (uptodate/dirty), and some locks. For filesystem metadata blocks I think that's all that most filesystems really need. Assuming that filesystems /never/ want overlapping metadata buffers, I think it's more efficient to look up buffer objects via an rhashtable instead of walking the address_space xarray to find a folio, and then walking a linked list from that folio to find the particular bh. Unfortunately, it also has a bunch of file mapping state information (e.g. BH_Delalloc) that aren't needed for caching metadata blocks. All the confusion that results from the incohesive mixing of these two usecases goes away by separating out the metadata buffers into a separate cache and (ha) leaving the filesystems to port the file IO paths to iomap. Separating filesystem metadata buffers into a private datastructure instead of using the blockdev pagecache also closes off an entire class of attack surface where evil userspace can wait for a filesystem to load a metadata block into memory and validate it; and then scribble on the pagecache block to cause the filesystem driver to make the wrong decisions -- look at all the ext4 metadata_csum bugs where syzkaller discovered that the decision to call the crc32c driver was gated on a bit in a bufferhead, and setting that bit having not initialized the crc32c driver would lead to a kernel crash. Nowadays we have CONFIG_BLK_DEV_WRITE_MOUNTED to shut that down, though it defaults to y and I think that might actually break leased layout things like pnfs. So the upsides are: faster lookups, a more cohesive data structure that only tries to do one thing, and closing attack surfaces. The downsides: this new buffer cache code still needs: an explicit hook into the dirty pagecache timeout to start its own writeback; to provide its own shrinker; and some sort of solution for file mapping metadata so that fsync can flush just those blocks and not the whole cache. --D