Re: [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 15 Apr 2025 22:01:44 -0700

On Tue, Apr 15, 2025 at 09:41:32PM -0700, Christoph Hellwig wrote:
> On Mon, Apr 14, 2025 at 05:14:05PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > With the new large sector size support, it's now the case that
> > set_blocksize can change i_blksize and the folio order in a manner that
> > conflicts with a concurrent reader and causes a kernel crash.
> > 
> > Specifically, let's say that udev-worker calls libblkid to detect the
> > labels on a block device.  The read call can create an order-0 folio to
> > read the first 4096 bytes from the disk.  But then udev is preempted.
> > 
> > Next, someone tries to mount an 8k-sectorsize filesystem from the same
> > block device.  The filesystem calls set_blksize, which sets i_blksize to
> > 8192 and the minimum folio order to 1.
> > 
> > Now udev resumes, still holding the order-0 folio it allocated.  It then
> > tries to schedule a read bio and do_mpage_readahead tries to create
> > bufferheads for the folio.  Unfortunately, blocks_per_folio == 0 because
> > the page size is 4096 but the blocksize is 8192 so no bufferheads are
> > attached and the bh walk never sets bdev.  We then submit the bio with a
> > NULL block device and crash.
> > 
> 
> Do we have a testcase for blktests or xfstests for this?  The issue is
> subtle and some of the code in the patch looks easy to accidentally
> break again (not the fault of this patch primarily).

It's the same patch as:
https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/

which is to say, xfs/032 with while true; do blkid; done running in the
background to increase the chances of a collision.

> >  	} else {
> > +		inode_lock_shared(bd_inode);
> >  		ret = blkdev_buffered_write(iocb, from);
> > +		inode_unlock_shared(bd_inode);
> 
> Does this need a comment why we take i_rwsem?
> 
> > +	inode_lock_shared(bd_inode);
> >  	ret = filemap_read(iocb, to, ret);
> > +	inode_unlock_shared(bd_inode);
> 
> Same here.  Especially as the protection is now heavier than for most
> file systems.

Yeah, somewhere we need a better comment.  How about this for
set_blocksize:

	/*
	 * Flush and truncate the pagecache before we reconfigure the
	 * mapping geometry because folio sizes are variable now.  If
	 * a reader has already allocated a folio whose size is smaller
	 * than the new min_order but invokes readahead after the new
	 * min_order becomes visible, readahead will think there are
	 * "zero" blocks per folio and crash.
	 */

And then the read/write paths can say something simpler:

	/*
	 * Take i_rwsem and invalidate_lock to avoid racing with a
	 * blocksize change punching out the pagecache.
	 */

> I also wonder if we need locking asserts in some of the write side
> functions that expect the shared inode lock and invalidate lock now?

Probably.  Do you have specific places in mind?

--D