Re: [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Tue, 15 Apr 2025 21:41:32 -0700

On Mon, Apr 14, 2025 at 05:14:05PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> With the new large sector size support, it's now the case that
> set_blocksize can change i_blksize and the folio order in a manner that
> conflicts with a concurrent reader and causes a kernel crash.
> 
> Specifically, let's say that udev-worker calls libblkid to detect the
> labels on a block device.  The read call can create an order-0 folio to
> read the first 4096 bytes from the disk.  But then udev is preempted.
> 
> Next, someone tries to mount an 8k-sectorsize filesystem from the same
> block device.  The filesystem calls set_blksize, which sets i_blksize to
> 8192 and the minimum folio order to 1.
> 
> Now udev resumes, still holding the order-0 folio it allocated.  It then
> tries to schedule a read bio and do_mpage_readahead tries to create
> bufferheads for the folio.  Unfortunately, blocks_per_folio == 0 because
> the page size is 4096 but the blocksize is 8192 so no bufferheads are
> attached and the bh walk never sets bdev.  We then submit the bio with a
> NULL block device and crash.
> 

Do we have a testcase for blktests or xfstests for this?  The issue is
subtle and some of the code in the patch looks easy to accidentally
break again (not the fault of this patch primarily).

>  	} else {
> +		inode_lock_shared(bd_inode);
>  		ret = blkdev_buffered_write(iocb, from);
> +		inode_unlock_shared(bd_inode);

Does this need a comment why we take i_rwsem?

> +	inode_lock_shared(bd_inode);
>  	ret = filemap_read(iocb, to, ret);
> +	inode_unlock_shared(bd_inode);

Same here.  Especially as the protection is now heavier than for most
file systems.

I also wonder if we need locking asserts in some of the write side
functions that expect the shared inode lock and invalidate lock now?