Re: warning on flushing page cache on block device removal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

On Tue 03-06-25 13:33:02, Parav Pandit wrote:
> > From: Parav Pandit <parav@xxxxxxxxxx>
> > Sent: Tuesday, May 27, 2025 7:55 PM
> > 
> > 
> > > From: Jan Kara <jack@xxxxxxx>
> > > Sent: Tuesday, May 27, 2025 7:51 PM
> > >
> > > On Tue 27-05-25 12:07:20, Parav Pandit wrote:
> > > > > From: Jan Kara <jack@xxxxxxx>
> > > > > Sent: Tuesday, May 27, 2025 5:27 PM
> > > > >
> > > > > On Tue 27-05-25 11:00:56, Parav Pandit wrote:
> > > > > > > From: Jan Kara <jack@xxxxxxx>
> > > > > > > Sent: Monday, May 26, 2025 10:09 PM
> > > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > On Sat 24-05-25 05:56:55, Parav Pandit wrote:
> > > > > > > > I am running a basic test of block device driver unbind,
> > > > > > > > bind while the fio is running random write IOs with
> > > > > > > > direct=0.  The test hits the WARN_ON assert on:
> > > > > > > >
> > > > > > > > void pagecache_isize_extended(struct inode *inode, loff_t
> > > > > > > > from, loff_t
> > > > > > > > to) {
> > > > > > > >         int bsize = i_blocksize(inode);
> > > > > > > >         loff_t rounded_from;
> > > > > > > >         struct folio *folio;
> > > > > > > >
> > > > > > > >         WARN_ON(to > inode->i_size);
> > > > > > > >
> > > > > > > > This is because when the block device is removed during
> > > > > > > > driver unbind, the driver flow is,
> > > > > > > >
> > > > > > > > del_gendisk()
> > > > > > > >     __blk_mark_disk_dead()
> > > > > > > >             set_capacity((disk, 0);
> > > > > > > >                 bdev_set_nr_sectors()
> > > > > > > >                     i_size_write() -> This will set the
> > > > > > > > inode's isize to 0, while the
> > > > > > > page cache is yet to be flushed.
> > > > > > > >
> > > > > > > > Below is the kernel call trace.
> > > > > > > >
> > > > > > > > Can someone help to identify, where should be the fix?
> > > > > > > > Should block layer to not set the capacity to 0?
> > > > > > > > Or page catch to overcome this dynamic changing of the size?
> > > > > > > > Or?
> > > > > > >
> > > > > > > After thinking about this the proper fix would be for
> > > > > > > i_size_write() to happen under i_rwsem because the change in
> > > > > > > the middle of the write is what's confusing the iomap code. I
> > > > > > > smell some deadlock potential here but it's perhaps worth
> > > > > > > trying :)
> > > > > > >
> > > > > > Without it, I gave a quick try with inode_lock() unlock() in
> > > > > > i_size_write() and initramfs level it was stuck.  I am yet to
> > > > > > try with LOCKDEP.
> > > > >
> > > > > You definitely cannot put inode_lock() into i_size_write().
> > > > > i_size_write() is expected to be called under inode_lock. And
> > > > > bdev_set_nr_sectors() is breaking this rule by not holding it. So
> > > > > what you can try is to do
> > > > > inode_lock() in bdev_set_nr_sectors() instead of grabbing bd_size_lock.
> > > > >
> 
> I replaced the bd_size_lock with inode_lock().
> Was unable to reproduce the issue yet with the fix.
> 
> However, it right away breaks the Atari floppy driver who invokes
> set_capacity() in queue_rq() at [1]. !!
> 
> [1] https://elixir.bootlin.com/linux/v6.15/source/drivers/block/ataflop.c#L1544

Yeah, that's somewhat unexpected. Anyway, Atari floppy driver isn't exactly
a reliable source of proper locking...

> With my limited knowledge I find the fix risky as bottom block layer is
> invoking upper FS layer inode lock.  I suspect it may lead to A->B, B->A
> locking in some path.

Well, I'm suspecting that as well. That's why I've asked you to test it :)
Are you running with lockdep enabled? Because I'd expect it to complain
rather quickly. Possibly run blktests as well to exercise various
not-so-usual paths so that lockdep learns about various lock dependencies.

> Other than Atari floppy driver, I didn't find any other offending driver,
> but its hard to say, its safe from A->B, B->A deadlock.
> A = inode lock
> B = block driver level lock

								Honza

> > > > Ok. will try this.
> > > > I am off for few days on travel, so earliest I can do is on Sunday.
> > > >
> > > > > > I was thinking, can the existing sequence lock be used for
> > > > > > 64-bit case as well?
> > > > >
> > > > > The sequence lock is about updating inode->i_size value itself.
> > > > > But we need much larger scale protection here - we need to make
> > > > > sure write to the block device is not happening while the device
> > > > > size changes. And that's what inode_lock is usually used for.
> > > > >
> > > > Other option to explore (with my limited knowledge) is, When the
> > > > block device is removed, not to update the size,
> > > >
> > > > Because queue dying flag and other barriers are placed to prevent
> > > > the IOs
> > > entering lower layer or to fail them.
> > > > Can that be the direction to fix?
> > >
> > > Well, that's definitely one line of defense and it's enough for reads
> > > but for writes you don't want them to accumulate in the page cache
> > > (and thus consume memory) when you know you have no way to write
> > them
> > > out. So there needs to be some way for buffered writes to recognize
> > > the backing store is gone and stop them before dirtying pages.
> > > Currently that's achieved by reducing i_size, we can think of other
> > > mechanisms but reducing i_size is kind of elegant if we can synchronize that
> > properly...
> > >
> > The block device notifies the bio layer by calling
> > blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); Maybe we can come
> > up with notification method that updates some flag to page cache layer to
> > drop buffered writes to floor.
> > 
> > Or other direction to explore, if the WAR_ON() is still valid, as it can change
> > anytime?
> > 
> > > 								Honza
> > > --
> > > Jan Kara <jack@xxxxxxxx>
> > > SUSE Labs, CR
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux