On Tue, Apr 15, 2025 at 09:41:32PM -0700, Christoph Hellwig wrote: > On Mon, Apr 14, 2025 at 05:14:05PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > With the new large sector size support, it's now the case that > > set_blocksize can change i_blksize and the folio order in a manner that > > conflicts with a concurrent reader and causes a kernel crash. > > > > Specifically, let's say that udev-worker calls libblkid to detect the > > labels on a block device. The read call can create an order-0 folio to > > read the first 4096 bytes from the disk. But then udev is preempted. > > > > Next, someone tries to mount an 8k-sectorsize filesystem from the same > > block device. The filesystem calls set_blksize, which sets i_blksize to > > 8192 and the minimum folio order to 1. > > > > Now udev resumes, still holding the order-0 folio it allocated. It then > > tries to schedule a read bio and do_mpage_readahead tries to create > > bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because > > the page size is 4096 but the blocksize is 8192 so no bufferheads are > > attached and the bh walk never sets bdev. We then submit the bio with a > > NULL block device and crash. > > > > Do we have a testcase for blktests or xfstests for this? The issue is > subtle and some of the code in the patch looks easy to accidentally > break again (not the fault of this patch primarily). It's the same patch as: https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/ which is to say, xfs/032 with while true; do blkid; done running in the background to increase the chances of a collision. > > } else { > > + inode_lock_shared(bd_inode); > > ret = blkdev_buffered_write(iocb, from); > > + inode_unlock_shared(bd_inode); > > Does this need a comment why we take i_rwsem? > > > + inode_lock_shared(bd_inode); > > ret = filemap_read(iocb, to, ret); > > + inode_unlock_shared(bd_inode); > > Same here. Especially as the protection is now heavier than for most > file systems. Yeah, somewhere we need a better comment. How about this for set_blocksize: /* * Flush and truncate the pagecache before we reconfigure the * mapping geometry because folio sizes are variable now. If * a reader has already allocated a folio whose size is smaller * than the new min_order but invokes readahead after the new * min_order becomes visible, readahead will think there are * "zero" blocks per folio and crash. */ And then the read/write paths can say something simpler: /* * Take i_rwsem and invalidate_lock to avoid racing with a * blocksize change punching out the pagecache. */ > I also wonder if we need locking asserts in some of the write side > functions that expect the shared inode lock and invalidate lock now? Probably. Do you have specific places in mind? --D