RE: warning on flushing page cache on block device removal

Parav Pandit <parav@xxxxxxxxxx> · Wed, 4 Jun 2025 14:32:34 +0000

> From: Jan Kara <jack@xxxxxxx>
> Sent: Wednesday, June 4, 2025 7:53 PM
> 
> Hi!
> 
> On Tue 03-06-25 13:33:02, Parav Pandit wrote:
> > > From: Parav Pandit <parav@xxxxxxxxxx>
> > > Sent: Tuesday, May 27, 2025 7:55 PM
> > >
> > >
> > > > From: Jan Kara <jack@xxxxxxx>
> > > > Sent: Tuesday, May 27, 2025 7:51 PM
> > > >
> > > > On Tue 27-05-25 12:07:20, Parav Pandit wrote:
> > > > > > From: Jan Kara <jack@xxxxxxx>
> > > > > > Sent: Tuesday, May 27, 2025 5:27 PM
> > > > > >
> > > > > > On Tue 27-05-25 11:00:56, Parav Pandit wrote:
> > > > > > > > From: Jan Kara <jack@xxxxxxx>
> > > > > > > > Sent: Monday, May 26, 2025 10:09 PM
> > > > > > > >
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > On Sat 24-05-25 05:56:55, Parav Pandit wrote:
> > > > > > > > > I am running a basic test of block device driver unbind,
> > > > > > > > > bind while the fio is running random write IOs with
> > > > > > > > > direct=0.  The test hits the WARN_ON assert on:
> > > > > > > > >
> > > > > > > > > void pagecache_isize_extended(struct inode *inode,
> > > > > > > > > loff_t from, loff_t
> > > > > > > > > to) {
> > > > > > > > >         int bsize = i_blocksize(inode);
> > > > > > > > >         loff_t rounded_from;
> > > > > > > > >         struct folio *folio;
> > > > > > > > >
> > > > > > > > >         WARN_ON(to > inode->i_size);
> > > > > > > > >
> > > > > > > > > This is because when the block device is removed during
> > > > > > > > > driver unbind, the driver flow is,
> > > > > > > > >
> > > > > > > > > del_gendisk()
> > > > > > > > >     __blk_mark_disk_dead()
> > > > > > > > >             set_capacity((disk, 0);
> > > > > > > > >                 bdev_set_nr_sectors()
> > > > > > > > >                     i_size_write() -> This will set the
> > > > > > > > > inode's isize to 0, while the
> > > > > > > > page cache is yet to be flushed.
> > > > > > > > >
> > > > > > > > > Below is the kernel call trace.
> > > > > > > > >
> > > > > > > > > Can someone help to identify, where should be the fix?
> > > > > > > > > Should block layer to not set the capacity to 0?
> > > > > > > > > Or page catch to overcome this dynamic changing of the size?
> > > > > > > > > Or?
> > > > > > > >
> > > > > > > > After thinking about this the proper fix would be for
> > > > > > > > i_size_write() to happen under i_rwsem because the change
> > > > > > > > in the middle of the write is what's confusing the iomap
> > > > > > > > code. I smell some deadlock potential here but it's
> > > > > > > > perhaps worth trying :)
> > > > > > > >
> > > > > > > Without it, I gave a quick try with inode_lock() unlock() in
> > > > > > > i_size_write() and initramfs level it was stuck.  I am yet
> > > > > > > to try with LOCKDEP.
> > > > > >
> > > > > > You definitely cannot put inode_lock() into i_size_write().
> > > > > > i_size_write() is expected to be called under inode_lock. And
> > > > > > bdev_set_nr_sectors() is breaking this rule by not holding it.
> > > > > > So what you can try is to do
> > > > > > inode_lock() in bdev_set_nr_sectors() instead of grabbing
> bd_size_lock.
> > > > > >
> >
> > I replaced the bd_size_lock with inode_lock().
> > Was unable to reproduce the issue yet with the fix.
> >
> > However, it right away breaks the Atari floppy driver who invokes
> > set_capacity() in queue_rq() at [1]. !!
> >
> > [1]
> > https://elixir.bootlin.com/linux/v6.15/source/drivers/block/ataflop.c#
> > L1544
> 
> Yeah, that's somewhat unexpected. Anyway, Atari floppy driver isn't exactly a
> reliable source of proper locking...
> 
True. I find it weird for one to change the capacity in middle of serving the block request.
Must be very old code.

> > With my limited knowledge I find the fix risky as bottom block layer
> > is invoking upper FS layer inode lock.  I suspect it may lead to A->B,
> > B->A locking in some path.
> 
> Well, I'm suspecting that as well. That's why I've asked you to test it :) Are you
> running with lockdep enabled? Because I'd expect it to complain rather quickly.
> Possibly run blktests as well to exercise various not-so-usual paths so that
> lockdep learns about various lock dependencies.
> 
Yes, with lockdep enabled. Nothing reported in this area.
But its test with only single block device = virtio block and with LSI Fusion MPT SAS scsi drives.

Seeing lock dep asserts in other areas not on block side yet. 😊

> > Other than Atari floppy driver, I didn't find any other offending
> > driver, but its hard to say, its safe from A->B, B->A deadlock.
> > A = inode lock
> > B = block driver level lock
> 
> 								Honza
> 
> > > > > Ok. will try this.
> > > > > I am off for few days on travel, so earliest I can do is on Sunday.
> > > > >
> > > > > > > I was thinking, can the existing sequence lock be used for
> > > > > > > 64-bit case as well?
> > > > > >
> > > > > > The sequence lock is about updating inode->i_size value itself.
> > > > > > But we need much larger scale protection here - we need to
> > > > > > make sure write to the block device is not happening while the
> > > > > > device size changes. And that's what inode_lock is usually used for.
> > > > > >
> > > > > Other option to explore (with my limited knowledge) is, When the
> > > > > block device is removed, not to update the size,
> > > > >
> > > > > Because queue dying flag and other barriers are placed to
> > > > > prevent the IOs
> > > > entering lower layer or to fail them.
> > > > > Can that be the direction to fix?
> > > >
> > > > Well, that's definitely one line of defense and it's enough for
> > > > reads but for writes you don't want them to accumulate in the page
> > > > cache (and thus consume memory) when you know you have no way to
> > > > write
> > > them
> > > > out. So there needs to be some way for buffered writes to
> > > > recognize the backing store is gone and stop them before dirtying pages.
> > > > Currently that's achieved by reducing i_size, we can think of
> > > > other mechanisms but reducing i_size is kind of elegant if we can
> > > > synchronize that
> > > properly...
> > > >
> > > The block device notifies the bio layer by calling
> > > blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); Maybe we can
> come
> > > up with notification method that updates some flag to page cache
> > > layer to drop buffered writes to floor.
> > >
> > > Or other direction to explore, if the WAR_ON() is still valid, as it
> > > can change anytime?
> > >
> > > > 								Honza
> > > > --
> > > > Jan Kara <jack@xxxxxxxx>
> > > > SUSE Labs, CR
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR