On Thu, Jun 26, 2025 at 08:48:56AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > On shutdown when quotas are enabled, the shutdown can deadlock > trying to unpin the dquot buffer buf_log_item like so: Reviewed-by: Carlos Maiolino <cmaiolino@xxxxxxxxxx> > > [ 3319.483590] task:kworker/20:0H state:D stack:14360 pid:1962230 tgid:1962230 ppid:2 task_flags:0x4208060 flags:0x00004000 > [ 3319.493966] Workqueue: xfs-log/dm-6 xlog_ioend_work > [ 3319.498458] Call Trace: > [ 3319.500800] <TASK> > [ 3319.502809] __schedule+0x699/0xb70 > [ 3319.512672] schedule+0x64/0xd0 > [ 3319.515573] schedule_timeout+0x30/0xf0 > [ 3319.528125] __down_common+0xc3/0x200 > [ 3319.531488] __down+0x1d/0x30 > [ 3319.534186] down+0x48/0x50 > [ 3319.540501] xfs_buf_lock+0x3d/0xe0 > [ 3319.543609] xfs_buf_item_unpin+0x85/0x1b0 > [ 3319.547248] xlog_cil_committed+0x289/0x570 > [ 3319.571411] xlog_cil_process_committed+0x6d/0x90 > [ 3319.575590] xlog_state_shutdown_callbacks+0x52/0x110 > [ 3319.580017] xlog_force_shutdown+0x169/0x1a0 > [ 3319.583780] xlog_ioend_work+0x7c/0xb0 > [ 3319.587049] process_scheduled_works+0x1d6/0x400 > [ 3319.591127] worker_thread+0x202/0x2e0 > [ 3319.594452] kthread+0x20c/0x240 > > The CIL push has seen the deadlock, so it has aborted the push and > is running CIL checkpoint completion to abort all the items in the > checkpoint. This calls ->iop_unpin(remove = true) to clean up the > log items in the checkpoint. > > When a buffer log item is unpined like this, it needs to lock the > buffer to run io completion to correctly fail the buffer and run all > the required completions to fail attached log items as well. In this > case, the attempt to lock the buffer on unpin is hanging because the > buffer is already locked. > > I suspected a leaked XFS_BLI_HOLD state because of XFS_BLI_STALE > handling changes I was testing, so I went looking for > pin events on HOLD buffers and unpin events on locked buffer. That > isolated this one buffer with these two events: > > xfs_buf_item_pin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 2 pincount 0 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags HOLD|DIRTY|LOGGED liflags DIRTY > .... > xfs_buf_item_unpin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 4 pincount 1 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags DIRTY liflags ABORTED > > Firstly, bbcount = 0x2, which means it is not a single sector > structure. That rules out every xfs_trans_bhold() case except one: > dquot buffers. > > Then hung task dumping gave this trace: > > [ 3197.312078] task:fsync-tester state:D stack:12080 pid:2051125 tgid:2051125 ppid:1643233 task_flags:0x400000 flags:0x00004002 > [ 3197.323007] Call Trace: > [ 3197.325581] <TASK> > [ 3197.327727] __schedule+0x699/0xb70 > [ 3197.334582] schedule+0x64/0xd0 > [ 3197.337672] schedule_timeout+0x30/0xf0 > [ 3197.350139] wait_for_completion+0xbd/0x180 > [ 3197.354235] __flush_workqueue+0xef/0x4e0 > [ 3197.362229] xlog_cil_force_seq+0xa0/0x300 > [ 3197.374447] xfs_log_force+0x77/0x230 > [ 3197.378015] xfs_qm_dqunpin_wait+0x49/0xf0 > [ 3197.382010] xfs_qm_dqflush+0x55/0x460 > [ 3197.385663] xfs_qm_dquot_isolate+0x29e/0x4d0 > [ 3197.389977] __list_lru_walk_one+0x141/0x220 > [ 3197.398867] list_lru_walk_one+0x10/0x20 > [ 3197.402713] xfs_qm_shrink_scan+0x6a/0x100 > [ 3197.406699] do_shrink_slab+0x18a/0x350 > [ 3197.410512] shrink_slab+0xf7/0x430 > [ 3197.413967] drop_slab+0x97/0xf0 > [ 3197.417121] drop_caches_sysctl_handler+0x59/0xc0 > [ 3197.421654] proc_sys_call_handler+0x18b/0x280 > [ 3197.426050] proc_sys_write+0x13/0x20 > [ 3197.429750] vfs_write+0x2b8/0x3e0 > [ 3197.438532] ksys_write+0x7e/0xf0 > [ 3197.441742] __x64_sys_write+0x1b/0x30 > [ 3197.445363] x64_sys_call+0x2c72/0x2f60 > [ 3197.449044] do_syscall_64+0x6c/0x140 > [ 3197.456341] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Yup, another test run by check-parallel is running drop_caches > concurrently and the dquot shrinker for the hung filesystem is > running. That's trying to flush a dirty dquot from reclaim context, > and it waiting on a log force to complete. xfs_qm_dqflush is called > with the dquot buffer held locked, and so we've called > xfs_log_force() with that buffer locked. > > Now the log force is waiting for a workqueue flush to complete, and > that workqueue flush is waiting of CIL checkpoint processing to > finish. > > The CIL checkpoint processing is aborting all the log items it has, > and that requires locking aborted buffers to cancel them. > > Now, normally this isn't a problem if we are issuing a log force > to unpin an object, because the ->iop_unpin() method wakes pin > waiters first. That results in the pin waiter finishing off whatever > it was doing, dropping the lock and then xfs_buf_item_unpin() can > lock the buffer and fail it. > > However, xfs_qm_dqflush() is waiting on the -dquot- unpin event, not > the dquot buffer unpin event, and so it never gets woken and so does > not drop the buffer lock. > > Inodes do not have this problem, as they can only be written from > one spot (->iop_push) whilst dquots can be written from multiple > places (memory reclaim, ->iop_push, xfs_dq_dqpurge, and quotacheck). > > The reason that the dquot buffer has an attached buffer log item is > that it has been recently allocated. Initialisation of the dquot > buffer logs the buffer directly, thereby pinning it in memory. We > then modify the dquot in a separate operation, and have memory > reclaim racing with a shutdown and we trigger this deadlock. > > check-parallel reproduces this reliably on 1kB FSB filesystems with > quota enabled because it does all of these things concurrently > without having to explicitly write tests to exercise these corner > case conditions. > > xfs_qm_dquot_logitem_push() doesn't have this deadlock because it > checks if the dquot is pinned before locking the dquot buffer and > skipping it if it is pinned. This means the xfs_qm_dqunpin_wait() > log force in xfs_qm_dqflush() never triggers and we unlock the > buffer safely allowing a concurrent shutdown to fail the buffer > appropriately. > > xfs_qm_dqpurge() could have this problem as it is called from > quotacheck and we might have allocated dquot buffers when recording > the quota updates. This can be fixed by calling > xfs_qm_dqunpin_wait() before we lock the dquot buffer. Because we > hold the dquot locked, nothing will be able to add to the pin count > between the unpin_wait and the dqflush callout, so this now makes > xfs_qm_dqpurge() safe against this race. > > xfs_qm_dquot_isolate() can also be fixed this same way but, quite > frankly, we shouldn't be doing IO in memory reclaim context. If the > dquot is pinned or dirty, simply rotate it and let memory reclaim > come back to it later, same as we do for inodes. > > This then gets rid of the nasty issue in xfs_qm_flush_one() where > quotacheck writeback races with memory reclaim flushing the dquots. > We can lift xfs_qm_dqunpin_wait() up into this code, then get rid of > the "can't get the dqflush lock" buffer write to cycle the dqlfush > lock and enable it to be flushed again. checking if the dquot is > pinned and returning -EAGAIN so that the dquot walk will revisit the > dquot again later. > > Finally, with xfs_qm_dqunpin_wait() lifted into all the callers, > we can remove it from the xfs_qm_dqflush() code. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_buf.c | 38 -------------------- > fs/xfs/xfs_buf.h | 1 - > fs/xfs/xfs_dquot.c | 4 +-- > fs/xfs/xfs_qm.c | 86 ++++++++++------------------------------------ > fs/xfs/xfs_trace.h | 1 - > 5 files changed, 20 insertions(+), 110 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 8af83bd161f9..ba5bd6031ece 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -2082,44 +2082,6 @@ xfs_buf_delwri_submit( > return error; > } > > -/* > - * Push a single buffer on a delwri queue. > - * > - * The purpose of this function is to submit a single buffer of a delwri queue > - * and return with the buffer still on the original queue. > - * > - * The buffer locking and queue management logic between _delwri_pushbuf() and > - * _delwri_queue() guarantee that the buffer cannot be queued to another list > - * before returning. > - */ > -int > -xfs_buf_delwri_pushbuf( > - struct xfs_buf *bp, > - struct list_head *buffer_list) > -{ > - int error; > - > - ASSERT(bp->b_flags & _XBF_DELWRI_Q); > - > - trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > - > - xfs_buf_lock(bp); > - bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC); > - bp->b_flags |= XBF_WRITE; > - xfs_buf_submit(bp); > - > - /* > - * The buffer is now locked, under I/O but still on the original delwri > - * queue. Wait for I/O completion, restore the DELWRI_Q flag and > - * return with the buffer unlocked and still on the original queue. > - */ > - error = xfs_buf_iowait(bp); > - bp->b_flags |= _XBF_DELWRI_Q; > - xfs_buf_unlock(bp); > - > - return error; > -} > - > void xfs_buf_set_ref(struct xfs_buf *bp, int lru_ref) > { > /* > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 9d2ab567cf81..15fc56948346 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -326,7 +326,6 @@ extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl); > extern int xfs_buf_delwri_submit(struct list_head *); > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > -extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > static inline xfs_daddr_t xfs_buf_daddr(struct xfs_buf *bp) > { > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c > index b4e32f0860b7..0bd8022e47b4 100644 > --- a/fs/xfs/xfs_dquot.c > +++ b/fs/xfs/xfs_dquot.c > @@ -1398,11 +1398,9 @@ xfs_qm_dqflush( > > ASSERT(XFS_DQ_IS_LOCKED(dqp)); > ASSERT(!completion_done(&dqp->q_flush)); > + ASSERT(atomic_read(&dqp->q_pincount) == 0); > > trace_xfs_dqflush(dqp); > - > - xfs_qm_dqunpin_wait(dqp); > - > fa = xfs_qm_dqflush_check(dqp); > if (fa) { > xfs_alert(mp, "corrupt dquot ID 0x%x in memory at %pS", > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > index 417439b58785..fa135ac26471 100644 > --- a/fs/xfs/xfs_qm.c > +++ b/fs/xfs/xfs_qm.c > @@ -134,6 +134,7 @@ xfs_qm_dqpurge( > > dqp->q_flags |= XFS_DQFLAG_FREEING; > > + xfs_qm_dqunpin_wait(dqp); > xfs_dqflock(dqp); > > /* > @@ -465,6 +466,7 @@ xfs_qm_dquot_isolate( > struct xfs_dquot *dqp = container_of(item, > struct xfs_dquot, q_lru); > struct xfs_qm_isolate *isol = arg; > + enum lru_status ret = LRU_SKIP; > > if (!xfs_dqlock_nowait(dqp)) > goto out_miss_busy; > @@ -477,6 +479,16 @@ xfs_qm_dquot_isolate( > if (dqp->q_flags & XFS_DQFLAG_FREEING) > goto out_miss_unlock; > > + /* > + * If the dquot is pinned or dirty, rotate it to the end of the LRU to > + * give some time for it to be cleaned before we try to isolate it > + * again. > + */ > + ret = LRU_ROTATE; > + if (XFS_DQ_IS_DIRTY(dqp) || atomic_read(&dqp->q_pincount) > 0) { > + goto out_miss_unlock; > + } > + > /* > * This dquot has acquired a reference in the meantime remove it from > * the freelist and try again. > @@ -492,41 +504,14 @@ xfs_qm_dquot_isolate( > } > > /* > - * If the dquot is dirty, flush it. If it's already being flushed, just > - * skip it so there is time for the IO to complete before we try to > - * reclaim it again on the next LRU pass. > + * The dquot may still be under IO, in which case the flush lock will be > + * held. If we can't get the flush lock now, just skip over the dquot as > + * if it was dirty. > */ > if (!xfs_dqflock_nowait(dqp)) > goto out_miss_unlock; > > - if (XFS_DQ_IS_DIRTY(dqp)) { > - struct xfs_buf *bp = NULL; > - int error; > - > - trace_xfs_dqreclaim_dirty(dqp); > - > - /* we have to drop the LRU lock to flush the dquot */ > - spin_unlock(&lru->lock); > - > - error = xfs_dquot_use_attached_buf(dqp, &bp); > - if (!bp || error == -EAGAIN) { > - xfs_dqfunlock(dqp); > - goto out_unlock_dirty; > - } > - > - /* > - * dqflush completes dqflock on error, and the delwri ioend > - * does it on success. > - */ > - error = xfs_qm_dqflush(dqp, bp); > - if (error) > - goto out_unlock_dirty; > - > - xfs_buf_delwri_queue(bp, &isol->buffers); > - xfs_buf_relse(bp); > - goto out_unlock_dirty; > - } > - > + ASSERT(!XFS_DQ_IS_DIRTY(dqp)); > xfs_dquot_detach_buf(dqp); > xfs_dqfunlock(dqp); > > @@ -548,13 +533,7 @@ xfs_qm_dquot_isolate( > out_miss_busy: > trace_xfs_dqreclaim_busy(dqp); > XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses); > - return LRU_SKIP; > - > -out_unlock_dirty: > - trace_xfs_dqreclaim_busy(dqp); > - XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses); > - xfs_dqunlock(dqp); > - return LRU_RETRY; > + return ret; > } > > static unsigned long > @@ -1486,7 +1465,6 @@ xfs_qm_flush_one( > struct xfs_dquot *dqp, > void *data) > { > - struct xfs_mount *mp = dqp->q_mount; > struct list_head *buffer_list = data; > struct xfs_buf *bp = NULL; > int error = 0; > @@ -1497,34 +1475,8 @@ xfs_qm_flush_one( > if (!XFS_DQ_IS_DIRTY(dqp)) > goto out_unlock; > > - /* > - * The only way the dquot is already flush locked by the time quotacheck > - * gets here is if reclaim flushed it before the dqadjust walk dirtied > - * it for the final time. Quotacheck collects all dquot bufs in the > - * local delwri queue before dquots are dirtied, so reclaim can't have > - * possibly queued it for I/O. The only way out is to push the buffer to > - * cycle the flush lock. > - */ > - if (!xfs_dqflock_nowait(dqp)) { > - /* buf is pinned in-core by delwri list */ > - error = xfs_buf_incore(mp->m_ddev_targp, dqp->q_blkno, > - mp->m_quotainfo->qi_dqchunklen, 0, &bp); > - if (error) > - goto out_unlock; > - > - if (!(bp->b_flags & _XBF_DELWRI_Q)) { > - error = -EAGAIN; > - xfs_buf_relse(bp); > - goto out_unlock; > - } > - xfs_buf_unlock(bp); > - > - xfs_buf_delwri_pushbuf(bp, buffer_list); > - xfs_buf_rele(bp); > - > - error = -EAGAIN; > - goto out_unlock; > - } > + xfs_qm_dqunpin_wait(dqp); > + xfs_dqflock(dqp); > > error = xfs_dquot_use_attached_buf(dqp, &bp); > if (error) > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 01d284a1c759..9f0d6bc966b7 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -778,7 +778,6 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done); > DEFINE_BUF_EVENT(xfs_buf_delwri_queue); > DEFINE_BUF_EVENT(xfs_buf_delwri_queued); > DEFINE_BUF_EVENT(xfs_buf_delwri_split); > -DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf); > DEFINE_BUF_EVENT(xfs_buf_get_uncached); > DEFINE_BUF_EVENT(xfs_buf_item_relse); > DEFINE_BUF_EVENT(xfs_buf_iodone_async); > -- > 2.45.2