> From: Jan Kara <jack@xxxxxxx> > Sent: Wednesday, June 4, 2025 7:53 PM > > Hi! > > On Tue 03-06-25 13:33:02, Parav Pandit wrote: > > > From: Parav Pandit <parav@xxxxxxxxxx> > > > Sent: Tuesday, May 27, 2025 7:55 PM > > > > > > > > > > From: Jan Kara <jack@xxxxxxx> > > > > Sent: Tuesday, May 27, 2025 7:51 PM > > > > > > > > On Tue 27-05-25 12:07:20, Parav Pandit wrote: > > > > > > From: Jan Kara <jack@xxxxxxx> > > > > > > Sent: Tuesday, May 27, 2025 5:27 PM > > > > > > > > > > > > On Tue 27-05-25 11:00:56, Parav Pandit wrote: > > > > > > > > From: Jan Kara <jack@xxxxxxx> > > > > > > > > Sent: Monday, May 26, 2025 10:09 PM > > > > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > > > On Sat 24-05-25 05:56:55, Parav Pandit wrote: > > > > > > > > > I am running a basic test of block device driver unbind, > > > > > > > > > bind while the fio is running random write IOs with > > > > > > > > > direct=0. The test hits the WARN_ON assert on: > > > > > > > > > > > > > > > > > > void pagecache_isize_extended(struct inode *inode, > > > > > > > > > loff_t from, loff_t > > > > > > > > > to) { > > > > > > > > > int bsize = i_blocksize(inode); > > > > > > > > > loff_t rounded_from; > > > > > > > > > struct folio *folio; > > > > > > > > > > > > > > > > > > WARN_ON(to > inode->i_size); > > > > > > > > > > > > > > > > > > This is because when the block device is removed during > > > > > > > > > driver unbind, the driver flow is, > > > > > > > > > > > > > > > > > > del_gendisk() > > > > > > > > > __blk_mark_disk_dead() > > > > > > > > > set_capacity((disk, 0); > > > > > > > > > bdev_set_nr_sectors() > > > > > > > > > i_size_write() -> This will set the > > > > > > > > > inode's isize to 0, while the > > > > > > > > page cache is yet to be flushed. > > > > > > > > > > > > > > > > > > Below is the kernel call trace. > > > > > > > > > > > > > > > > > > Can someone help to identify, where should be the fix? > > > > > > > > > Should block layer to not set the capacity to 0? > > > > > > > > > Or page catch to overcome this dynamic changing of the size? > > > > > > > > > Or? > > > > > > > > > > > > > > > > After thinking about this the proper fix would be for > > > > > > > > i_size_write() to happen under i_rwsem because the change > > > > > > > > in the middle of the write is what's confusing the iomap > > > > > > > > code. I smell some deadlock potential here but it's > > > > > > > > perhaps worth trying :) > > > > > > > > > > > > > > > Without it, I gave a quick try with inode_lock() unlock() in > > > > > > > i_size_write() and initramfs level it was stuck. I am yet > > > > > > > to try with LOCKDEP. > > > > > > > > > > > > You definitely cannot put inode_lock() into i_size_write(). > > > > > > i_size_write() is expected to be called under inode_lock. And > > > > > > bdev_set_nr_sectors() is breaking this rule by not holding it. > > > > > > So what you can try is to do > > > > > > inode_lock() in bdev_set_nr_sectors() instead of grabbing > bd_size_lock. > > > > > > > > > > I replaced the bd_size_lock with inode_lock(). > > Was unable to reproduce the issue yet with the fix. > > > > However, it right away breaks the Atari floppy driver who invokes > > set_capacity() in queue_rq() at [1]. !! > > > > [1] > > https://elixir.bootlin.com/linux/v6.15/source/drivers/block/ataflop.c# > > L1544 > > Yeah, that's somewhat unexpected. Anyway, Atari floppy driver isn't exactly a > reliable source of proper locking... > True. I find it weird for one to change the capacity in middle of serving the block request. Must be very old code. > > With my limited knowledge I find the fix risky as bottom block layer > > is invoking upper FS layer inode lock. I suspect it may lead to A->B, > > B->A locking in some path. > > Well, I'm suspecting that as well. That's why I've asked you to test it :) Are you > running with lockdep enabled? Because I'd expect it to complain rather quickly. > Possibly run blktests as well to exercise various not-so-usual paths so that > lockdep learns about various lock dependencies. > Yes, with lockdep enabled. Nothing reported in this area. But its test with only single block device = virtio block and with LSI Fusion MPT SAS scsi drives. Seeing lock dep asserts in other areas not on block side yet. 😊 > > Other than Atari floppy driver, I didn't find any other offending > > driver, but its hard to say, its safe from A->B, B->A deadlock. > > A = inode lock > > B = block driver level lock > > Honza > > > > > > Ok. will try this. > > > > > I am off for few days on travel, so earliest I can do is on Sunday. > > > > > > > > > > > > I was thinking, can the existing sequence lock be used for > > > > > > > 64-bit case as well? > > > > > > > > > > > > The sequence lock is about updating inode->i_size value itself. > > > > > > But we need much larger scale protection here - we need to > > > > > > make sure write to the block device is not happening while the > > > > > > device size changes. And that's what inode_lock is usually used for. > > > > > > > > > > > Other option to explore (with my limited knowledge) is, When the > > > > > block device is removed, not to update the size, > > > > > > > > > > Because queue dying flag and other barriers are placed to > > > > > prevent the IOs > > > > entering lower layer or to fail them. > > > > > Can that be the direction to fix? > > > > > > > > Well, that's definitely one line of defense and it's enough for > > > > reads but for writes you don't want them to accumulate in the page > > > > cache (and thus consume memory) when you know you have no way to > > > > write > > > them > > > > out. So there needs to be some way for buffered writes to > > > > recognize the backing store is gone and stop them before dirtying pages. > > > > Currently that's achieved by reducing i_size, we can think of > > > > other mechanisms but reducing i_size is kind of elegant if we can > > > > synchronize that > > > properly... > > > > > > > The block device notifies the bio layer by calling > > > blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); Maybe we can > come > > > up with notification method that updates some flag to page cache > > > layer to drop buffered writes to floor. > > > > > > Or other direction to explore, if the WAR_ON() is still valid, as it > > > can change anytime? > > > > > > > Honza > > > > -- > > > > Jan Kara <jack@xxxxxxxx> > > > > SUSE Labs, CR > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR