On Mon, Jan 06, 2025 at 07:03:03PM +0100, Jan Kara wrote: > On Tue 29-10-24 13:45:01, Catherine Hoang wrote: > > This patch removes the use of buffer heads from the quota read and > > write paths. To do so, we implement a new buffer cache using an > > rhashtable. Each buffer stores data from an associated block, and > > can be read or written to as needed. > > > > Ultimately, we want to completely remove buffer heads from ext2. > > This patch serves as an example than can be applied to other parts > > of the filesystem. > > > > Signed-off-by: Catherine Hoang <catherine.hoang@xxxxxxxxxx> > > Thanks for the patch and sorry for the delay. Overall I'd say this is a bit > too rudimentary :). See below. Sorry myself for not noticing this until now. :/ I'll snip out pieces to keep this reply short. > > +static void buf_write_end_io(struct bio *bio) > > +{ > > + struct ext2_buffer *buf = bio->bi_private; > > + > > + clear_bit(EXT2_BUF_DIRTY_BIT, &buf->b_flags); > > Clearing the bit after IO has finished is too late. There can be > modifications done to the buffer while the IO is running and this way you > could clear the bit while not all changes in the buffer are written out. (This seems to have been changed to a completion in V2) > > +int sync_buffers(struct super_block *sb) > > +{ > > + struct ext2_sb_info *sbi = EXT2_SB(sb); > > + struct rhashtable *buffer_cache = &sbi->buffer_cache; > > + struct rhashtable_iter iter; > > + struct ext2_buffer *buf; > > + struct blk_plug plug; > > + > > + blk_start_plug(&plug); > > + rhashtable_walk_enter(buffer_cache, &iter); > > + rhashtable_walk_start(&iter); > > + while ((buf = rhashtable_walk_next(&iter)) != NULL) { > > + if (IS_ERR(buf)) > > + continue; > > + if (test_bit(EXT2_BUF_DIRTY_BIT, &buf->b_flags)) { > > + submit_buffer_write(sb, buf); > > + } > > + } > > + rhashtable_walk_stop(&iter); > > + rhashtable_walk_exit(&iter); > > + blk_finish_plug(&plug); > > + > > + return 0; > > +} > > I think we need some kind of periodic writeback as well. Agreed. One of the downsides (I think) of moving away from the block device pagecache is that we no longer get the automatic 30s dirty writeout that takes care of this for the block_device. Do you know how to do this? > Also fsync(2) will need to flush appropriate metadata buffers as well > and we need to track them together with the inode to which metadata > belongs. Are you talking about the file mapping blocks? I wonder if we could (re)use mapping->i_private_list for that purpose, though that could be messy given all the warnings in buffer.c about how that list tracks bufferheads and they must belong to the blockdev. But yes, it would be useful to be able to associate ext2_buffers with the owning inode for file metadata so that fsync can perform a targeted flush much the same way that we do for bufferheads now. > > +struct ext2_buffer *get_buffer(struct super_block *sb, sector_t block, bool need_uptodate) > > +{ > > + int err; > > + struct ext2_buffer *buf; > > + struct ext2_buffer *new_buf; > > + > > + buf = lookup_buffer_cache(sb, block); > > + > > + if (!buf) { > > + err = init_buffer(sb, block, &new_buf); > > + if (err) > > + return ERR_PTR(err); > > + > > + if (need_uptodate) { > > + err = submit_buffer_read(sb, new_buf); > > + if (err) > > + return ERR_PTR(err); > > + } > > + > > + buf = insert_buffer_cache(sb, new_buf); > > So this can return the old buffer as well in which case you need to free > the new one. (It looks like this was fixed in V2) > > + if (IS_ERR(buf)) > > + kfree(new_buf); > > + } > > + > > + mutex_lock(&buf->b_lock); > > + refcount_inc(&buf->b_refcount); > > So currently I don't see any use of the refcount. It's always incremented > when locking b_lock and decremented before unlocking it. I don't see very many uses of it either, though I think it's useful to prevent UAF problems on the ext2_buffers themselves. In the longer term, I think this buffer cache needs a shrinker, and the shrinker will want to ignore any buffers with refcount > 1. > Also locking b_lock whenever acquiring the buffer more or less works for > quota code but for more complex locking scenarios (xattrs come to mind) it > will not be really usable so we probably shouldn't mix get/put buffer and > locking of b_lock? What are the locking rules for xattrs? It looks like it locks the bufferhead whenever making updates to the xattr contents or the bh state; or when updating the mbcache. The get/list code doesn't seem to lock the bh, nor does the mbcache that provides deduplication. So I gather that inode->i_rwsem is a higher level lock to coordinate access to the xattr data, and the bh lock only coordinates updates to b_data or the bh state itself? > > +int ext2_get_block_bno(struct inode *inode, sector_t iblock, > > + int create, u32 *bno, bool *mapped) > > +{ > > + struct super_block *sb = inode->i_sb; > > + struct buffer_head tmp_bh; > > + int err; > > + > > + tmp_bh.b_state = 0; > > + tmp_bh.b_size = sb->s_blocksize; > > + > > + err = ext2_get_block(inode, iblock, &tmp_bh, 0); > > I suppose you need to pass the 'create' argument here? (I think this went away in V2) > > + if (err) > > + return err; > > + > > + *mapped = buffer_mapped(&tmp_bh); > > + *bno = tmp_bh.b_blocknr; > > + > > + return 0; > > +} > > + > > So overall I'd say there's more functionality needed to be able to replace > the buffer cache even for ext2. Agreed -- periodic writeback, a shrinker, and porting for inodes, inode bitmaps, mapping blocks, and xattr blocks are still to come before this can exit rfc status. --D