Re: [RFC PATCH v2 0/4] remove buffer heads from ext2

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 4 Apr 2025 09:43:22 -0700

On Thu, Mar 27, 2025 at 03:32:42AM -0700, Christoph Hellwig wrote:
> On Tue, Mar 25, 2025 at 06:49:24PM -0700, Catherine Hoang wrote:
> > Hi all,
> > 
> > This series is an effort to begin removing buffer heads from ext2. 
> 
> Why is that desirable?

struct buffer_head is a mismash of things -- originally it was a landing
place for the old buffer cache, right?  So it has the necessary things
like a pointer to a memory page, the disk address, a length, buffer
state flags (uptodate/dirty), and some locks.  For filesystem metadata
blocks I think that's all that most filesystems really need.  Assuming
that filesystems /never/ want overlapping metadata buffers, I think it's
more efficient to look up buffer objects via an rhashtable instead of
walking the address_space xarray to find a folio, and then walking a
linked list from that folio to find the particular bh.

Unfortunately, it also has a bunch of file mapping state information
(e.g. BH_Delalloc) that aren't needed for caching metadata blocks.  All
the confusion that results from the incohesive mixing of these two
usecases goes away by separating out the metadata buffers into a
separate cache and (ha) leaving the filesystems to port the file IO
paths to iomap.

Separating filesystem metadata buffers into a private datastructure
instead of using the blockdev pagecache also closes off an entire class
of attack surface where evil userspace can wait for a filesystem to load
a metadata block into memory and validate it; and then scribble on the
pagecache block to cause the filesystem driver to make the wrong
decisions -- look at all the ext4 metadata_csum bugs where syzkaller
discovered that the decision to call the crc32c driver was gated on a
bit in a bufferhead, and setting that bit having not initialized the
crc32c driver would lead to a kernel crash.  Nowadays we have
CONFIG_BLK_DEV_WRITE_MOUNTED to shut that down, though it defaults to y
and I think that might actually break leased layout things like pnfs.

So the upsides are: faster lookups, a more cohesive data structure that
only tries to do one thing, and closing attack surfaces.

The downsides: this new buffer cache code still needs: an explicit hook
into the dirty pagecache timeout to start its own writeback; to provide
its own shrinker; and some sort of solution for file mapping metadata so
that fsync can flush just those blocks and not the whole cache.

--D