Re: warning on flushing page cache on block device removal

Jan Kara <jack@xxxxxxx> · Wed, 18 Jun 2025 17:09:19 +0200

On Wed 18-06-25 09:27:47, Parav Pandit wrote:
> Hi Jan and others,
> 
> > From: Parav Pandit
> > Sent: 03 June 2025 07:03 PM
> > To: Jan Kara <jack@xxxxxxx>
> > Cc: linux-block@xxxxxxxxxxxxxxx; linux-fsdevel@xxxxxxxxxxxxxxx
> > Subject: RE: warning on flushing page cache on block device removal
> > 
> > Hi Jan,
> > 
> > > From: Parav Pandit <parav@xxxxxxxxxx>
> > > Sent: Tuesday, May 27, 2025 7:55 PM
> > >
> > >
> > > > From: Jan Kara <jack@xxxxxxx>
> > > > Sent: Tuesday, May 27, 2025 7:51 PM
> > > >
> > > > On Tue 27-05-25 12:07:20, Parav Pandit wrote:
> > > > > > From: Jan Kara <jack@xxxxxxx>
> > > > > > Sent: Tuesday, May 27, 2025 5:27 PM
> > > > > >
> > > > > > On Tue 27-05-25 11:00:56, Parav Pandit wrote:
> > > > > > > > From: Jan Kara <jack@xxxxxxx>
> > > > > > > > Sent: Monday, May 26, 2025 10:09 PM
> > > > > > > >
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > On Sat 24-05-25 05:56:55, Parav Pandit wrote:
> > > > > > > > > I am running a basic test of block device driver unbind,
> > > > > > > > > bind while the fio is running random write IOs with
> > > > > > > > > direct=0.  The test hits the WARN_ON assert on:
> > > > > > > > >
> > > > > > > > > void pagecache_isize_extended(struct inode *inode, loff_t
> > > > > > > > > from, loff_t
> > > > > > > > > to) {
> > > > > > > > >         int bsize = i_blocksize(inode);
> > > > > > > > >         loff_t rounded_from;
> > > > > > > > >         struct folio *folio;
> > > > > > > > >
> > > > > > > > >         WARN_ON(to > inode->i_size);
> > > > > > > > >
> > > > > > > > > This is because when the block device is removed during
> > > > > > > > > driver unbind, the driver flow is,
> > > > > > > > >
> > > > > > > > > del_gendisk()
> > > > > > > > >     __blk_mark_disk_dead()
> > > > > > > > >             set_capacity((disk, 0);
> > > > > > > > >                 bdev_set_nr_sectors()
> > > > > > > > >                     i_size_write() -> This will set the
> > > > > > > > > inode's isize to 0, while the
> > > > > > > > page cache is yet to be flushed.
> > > > > > > > >
> > > > > > > > > Below is the kernel call trace.
> > > > > > > > >
> > > > > > > > > Can someone help to identify, where should be the fix?
> > > > > > > > > Should block layer to not set the capacity to 0?
> > > > > > > > > Or page catch to overcome this dynamic changing of the size?
> > > > > > > > > Or?
> > > > > > > >
> > > > > > > > After thinking about this the proper fix would be for
> > > > > > > > i_size_write() to happen under i_rwsem because the change in
> > > > > > > > the middle of the write is what's confusing the iomap code.
> > > > > > > > I smell some deadlock potential here but it's perhaps worth
> > > > > > > > trying :)
> > > > > > > >
> > > > > > > Without it, I gave a quick try with inode_lock() unlock() in
> > > > > > > i_size_write() and initramfs level it was stuck.  I am yet to
> > > > > > > try with LOCKDEP.
> > > > > >
> > > > > > You definitely cannot put inode_lock() into i_size_write().
> > > > > > i_size_write() is expected to be called under inode_lock. And
> > > > > > bdev_set_nr_sectors() is breaking this rule by not holding it.
> > > > > > So what you can try is to do
> > > > > > inode_lock() in bdev_set_nr_sectors() instead of grabbing bd_size_lock.
> > > > > >
> > 
> > I replaced the bd_size_lock with inode_lock().
> > Was unable to reproduce the issue yet with the fix.
> > 
> > However, it right away breaks the Atari floppy driver who invokes
> > set_capacity() in queue_rq() at [1]. !!
> > 
> > [1]
> > https://elixir.bootlin.com/linux/v6.15/source/drivers/block/ataflop.c#L1544
> > 
> > With my limited knowledge I find the fix risky as bottom block layer is invoking
> > upper FS layer inode lock.
> > I suspect it may lead to A->B, B->A locking in some path.
> > 
> > Other than Atari floppy driver, I didn't find any other offending driver, but its
> > hard to say, its safe from A->B, B->A deadlock.
> > A = inode lock
> > B = block driver level lock
> > 
> > > > > Ok. will try this.
> > > > > I am off for few days on travel, so earliest I can do is on Sunday.
> > > > >
> > > > > > > I was thinking, can the existing sequence lock be used for
> > > > > > > 64-bit case as well?
> > > > > >
> > > > > > The sequence lock is about updating inode->i_size value itself.
> > > > > > But we need much larger scale protection here - we need to make
> > > > > > sure write to the block device is not happening while the device
> > > > > > size changes. And that's what inode_lock is usually used for.
> > > > > >
> > > > > Other option to explore (with my limited knowledge) is, When the
> > > > > block device is removed, not to update the size,
> > > > >
> > > > > Because queue dying flag and other barriers are placed to prevent
> > > > > the IOs
> > > > entering lower layer or to fail them.
> > > > > Can that be the direction to fix?
> > > >
> > > > Well, that's definitely one line of defense and it's enough for
> > > > reads but for writes you don't want them to accumulate in the page
> > > > cache (and thus consume memory) when you know you have no way to
> > > > write
> > > them
> > > > out. So there needs to be some way for buffered writes to recognize
> > > > the backing store is gone and stop them before dirtying pages.
> > > > Currently that's achieved by reducing i_size, we can think of other
> > > > mechanisms but reducing i_size is kind of elegant if we can
> > > > synchronize that
> > > properly...
> > > >
> > > The block device notifies the bio layer by calling
> > > blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); Maybe we can
> > come
> > > up with notification method that updates some flag to page cache layer
> > > to drop buffered writes to floor.
> > >
> > > Or other direction to explore, if the WAR_ON() is still valid, as it
> > > can change anytime?
> > >
> 
> Is below WARN_ON() still valid, given the disk size can change any time?
> 
> void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to) {
>         int bsize = i_blocksize(inode);
>         loff_t rounded_from;
>         struct folio *folio;
> 
>         WARN_ON(to > inode->i_size);

Yes, the WARN_ON is valid. It complains that the zeroing of tail pages on
extension of the file didn't do it's work properly. This is important for
regular files. For block devices the situation is different as they cannot
really be extended by writes so this code shouldn't ever trigger (but it
does because the device invalidation trims i_size to 0 in the middle of the
write).

One relatively easy workaround would be to teach iomap_write_iter() to not
touch i_size (and call pagecache_isize_extended()) for S_ISBLK inodes. That
will at least fix the immediate problems you can hit without too much pain
but the underlying problem that block device code should be holding i_rwsem
when modifying i_size remains and can bite us in other places...

								Honza

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR