On Thu, Jun 26, 2025 at 08:49:00AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Unmount of a shutdown filesystem can hang with stale inode cluster > buffers in the AIL like so: > > [95964.140623] Call Trace: > [95964.144641] __schedule+0x699/0xb70 > [95964.154003] schedule+0x64/0xd0 > [95964.156851] xfs_ail_push_all_sync+0x9b/0xf0 > [95964.164816] xfs_unmount_flush_inodes+0x41/0x70 > [95964.168698] xfs_unmountfs+0x7f/0x170 > [95964.171846] xfs_fs_put_super+0x3b/0x90 > [95964.175216] generic_shutdown_super+0x77/0x160 > [95964.178060] kill_block_super+0x1b/0x40 > [95964.180553] xfs_kill_sb+0x12/0x30 > [95964.182796] deactivate_locked_super+0x38/0x100 > [95964.185735] deactivate_super+0x41/0x50 > [95964.188245] cleanup_mnt+0x9f/0x160 > [95964.190519] __cleanup_mnt+0x12/0x20 > [95964.192899] task_work_run+0x89/0xb0 > [95964.195221] resume_user_mode_work+0x4f/0x60 > [95964.197931] syscall_exit_to_user_mode+0x76/0xb0 > [95964.201003] do_syscall_64+0x74/0x130 > > Tracing on a hung system shows the AIL repeating every 50ms a log > force followed by an attempt to push pinned, aborted inodes from the > AIL (trimmed for brevity): > > xfs_log_force: lsn 0x1c caller xfsaild+0x18e > xfs_log_force: lsn 0x0 caller xlog_cil_flush+0xbd > xfs_log_force: lsn 0x1c caller xfs_log_force+0x77 > xfs_ail_pinned: lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED > xfs_ail_pinned: lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED > xfs_ail_pinned: lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED > xfs_ail_pinned: lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED > xfs_ail_pinned: lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED > xfs_ail_pinned: lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED > .... > > The inode log items are marked as aborted, which means that either: > > a) a transaction commit has occurred, seen an error or shutdown, and > called xfs_trans_free_items() to abort the items. This should happen > before any pinning of log items occurs. > > or > > b) a dirty transaction has been cancelled. This should also happen > before any pinning of log items occurs. > > or > > c) AIL insertion at journal IO completion is marked as aborted. In > this case, the log item is pinned by the CIL until journal IO > completes and hence needs to be unpinned. This is then done after > the ->iop_committed() callback is run, so the pin count should be > balanced correctly. > > Yet none of these seemed to be occurring. Further tracing indicated > this: > > d) Shutdown during CIL pushing resulting in log item completion > being called from checkpoint abort processing. Items are unpinned > and released without serialisation against each other, journal IO > completion or transaction commit completion. > > In this case, we may still have a transaction commit in flight that > holds a reference to a xfs_buf_log_item (BLI) after CIL insertion. > e.g. a synchronous transaction will flush the CIL before the > transaction is torn down. The concurrent CIL push then aborts > insertion it and drops the commit/AIL reference to the BLI. This can > leave the transaction commit context with the last reference to the > BLI which is dropped here: > > xfs_trans_free_items() > ->iop_release > xfs_buf_item_release > xfs_buf_item_put > if (XFS_LI_ABORTED) > xfs_trans_ail_delete > xfs_buf_item_relse() > > Unlike the journal completion ->iop_unpin path, this path does not > run stale buffer completion process when it drops the last > reference, hence leaving the stale inodes attached to the buffer > sitting the AIL. There are no other references to those inodes, so > there is no other mechanism to remove them from the AIL. Hence > unmount hangs. > > The buffer lock context for stale buffers is passed to the last BLI > reference. This is normally the last BLI unpin on journal IO > completion. The unpin then processes the stale buffer completion and > releases the buffer lock. However, if the final unpin from journal > IO completion (or CIL push abort) does not hold the last reference > to the BLI, there -must- still be a transaction context that > references the BLI, and so that context must perform the stale > buffer completion processing before the buffer is unlocked and the > BLI torn down. > > The fix for this is to rework the xfs_buf_item_relse() path to run > stale buffer completion processing if it drops the last reference to > the BLI. We still hold the buffer locked, so the buffer owner and > lock context is the same as if we passed the BLI and buffer to the > ->iop_unpin() context to finish stale process on journal commit. > > However, we have to be careful here. In a shutdown state, we can be > freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse() > and xfs_trans_bdetach(). The existing code handles this case by > considering shutdown state as "aborted", but in doing so > largely masks the failure to clean up stale BLI state from the > xfs_buf_item_relse() path. i.e regardless of the shutdown state and > whether the item is in the AIL, we must finish the stale buffer > cleanup if we are are dropping the last BLI reference from the > ->iop_relse path in transaction commit context. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_buf_item.c | 121 +++++++++++++++++++++++++++++------------- > fs/xfs/xfs_buf_item.h | 2 +- > 2 files changed, 86 insertions(+), 37 deletions(-) > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > index c95826863c82..7fc54725c5f6 100644 > --- a/fs/xfs/xfs_buf_item.c > +++ b/fs/xfs/xfs_buf_item.c > @@ -612,43 +612,42 @@ xfs_buf_item_push( > * Drop the buffer log item refcount and take appropriate action. This helper > * determines whether the bli must be freed or not, since a decrement to zero > * does not necessarily mean the bli is unused. > - * > - * Return true if the bli is freed, false otherwise. > */ > -bool > +void > xfs_buf_item_put( > struct xfs_buf_log_item *bip) > { > - struct xfs_log_item *lip = &bip->bli_item; > - bool aborted; > - bool dirty; > + > + ASSERT(xfs_buf_islocked(bip->bli_buf)); > > /* drop the bli ref and return if it wasn't the last one */ > if (!atomic_dec_and_test(&bip->bli_refcount)) > - return false; > + return; > > - /* > - * We dropped the last ref and must free the item if clean or aborted. > - * If the bli is dirty and non-aborted, the buffer was clean in the > - * transaction but still awaiting writeback from previous changes. In > - * that case, the bli is freed on buffer writeback completion. > - */ > - aborted = test_bit(XFS_LI_ABORTED, &lip->li_flags) || > - xlog_is_shutdown(lip->li_log); > - dirty = bip->bli_flags & XFS_BLI_DIRTY; > - if (dirty && !aborted) > - return false; > + /* If the BLI is in the AIL, then it is still dirty and in use */ > + if (test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags)) { > + ASSERT(bip->bli_flags & XFS_BLI_DIRTY); > + return; > + } > > /* > - * The bli is aborted or clean. An aborted item may be in the AIL > - * regardless of dirty state. For example, consider an aborted > - * transaction that invalidated a dirty bli and cleared the dirty > - * state. > + * In shutdown conditions, we can be asked to free a dirty BLI that > + * isn't in the AIL. This can occur due to a checkpoint aborting a BLI > + * instead of inserting it into the AIL at checkpoint IO completion. If > + * there's another bli reference (e.g. a btree cursor holds a clean > + * reference) and it is released via xfs_trans_brelse(), we can get here > + * with that aborted, dirty BLI. In this case, it is safe to free the > + * dirty BLI immediately, as it is not in the AIL and there are no > + * other references to it. > + * > + * We should never get here with a stale BLI via that path as > + * xfs_trans_brelse() specifically holds onto stale buffers rather than > + * releasing them. > */ > - if (aborted) > - xfs_trans_ail_delete(lip, 0); > + ASSERT(!(bip->bli_flags & XFS_BLI_DIRTY) || > + test_bit(XFS_LI_ABORTED, &bip->bli_item.li_flags)); > + ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); > xfs_buf_item_relse(bip); > - return true; > } > > /* > @@ -669,6 +668,15 @@ xfs_buf_item_put( > * if necessary but do not unlock the buffer. This is for support of > * xfs_trans_bhold(). Make sure the XFS_BLI_HOLD field is cleared if we don't > * free the item. > + * > + * If the XFS_BLI_STALE flag is set, the last reference to the BLI *must* > + * perform a completion abort of any objects attached to the buffer for IO > + * tracking purposes. This generally only happens in shutdown situations, > + * normally xfs_buf_item_unpin() will drop the last BLI reference and perform > + * completion processing. However, because transaction completion can race with > + * checkpoint completion during a shutdown, this release context may end up > + * being the last active reference to the BLI and so needs to perform this > + * cleanup. > */ > STATIC void > xfs_buf_item_release( > @@ -676,18 +684,19 @@ xfs_buf_item_release( > { > struct xfs_buf_log_item *bip = BUF_ITEM(lip); > struct xfs_buf *bp = bip->bli_buf; > - bool released; > bool hold = bip->bli_flags & XFS_BLI_HOLD; > bool stale = bip->bli_flags & XFS_BLI_STALE; > -#if defined(DEBUG) || defined(XFS_WARN) > - bool ordered = bip->bli_flags & XFS_BLI_ORDERED; > - bool dirty = bip->bli_flags & XFS_BLI_DIRTY; > bool aborted = test_bit(XFS_LI_ABORTED, > &lip->li_flags); > + bool dirty = bip->bli_flags & XFS_BLI_DIRTY; > +#if defined(DEBUG) || defined(XFS_WARN) > + bool ordered = bip->bli_flags & XFS_BLI_ORDERED; > #endif > > trace_xfs_buf_item_release(bip); > > + ASSERT(xfs_buf_islocked(bp)); > + > /* > * The bli dirty state should match whether the blf has logged segments > * except for ordered buffers, where only the bli should be dirty. > @@ -703,16 +712,56 @@ xfs_buf_item_release( > bp->b_transp = NULL; > bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED); > > + /* If there are other references, then we have nothing to do. */ > + if (!atomic_dec_and_test(&bip->bli_refcount)) > + goto out_release; > + > /* > - * Unref the item and unlock the buffer unless held or stale. Stale > - * buffers remain locked until final unpin unless the bli is freed by > - * the unref call. The latter implies shutdown because buffer > - * invalidation dirties the bli and transaction. > + * Stale buffer completion frees the BLI, unlocks and releases the > + * buffer. Neither the BLI or buffer are safe to reference after this > + * call, so there's nothing more we need to do here. > + * > + * If we get here with a stale buffer and references to the BLI remain, > + * we must not unlock the buffer as the last BLI reference owns lock > + * context, not us. > */ > - released = xfs_buf_item_put(bip); > - if (hold || (stale && !released)) > + if (stale) { > + xfs_buf_item_finish_stale(bip); > + xfs_buf_relse(bp); > + ASSERT(!hold); > + return; > + } > + > + /* > + * Dirty or clean, aborted items are done and need to be removed from > + * the AIL and released. This frees the BLI, but leaves the buffer > + * locked and referenced. > + */ > + if (aborted || xlog_is_shutdown(lip->li_log)) { > + ASSERT(list_empty(&bip->bli_buf->b_li_list)); > + xfs_buf_item_done(bp); > + goto out_release; > + } > + > + /* > + * Clean, unreferenced BLIs can be immediately freed, leaving the buffer > + * locked and referenced. > + * > + * Dirty, unreferenced BLIs *must* be in the AIL awaiting writeback. > + */ > + if (!dirty) > + xfs_buf_item_relse(bip); > + else > + ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags)); > + > + /* Not safe to reference the BLI from here */ > +out_release: > + /* > + * If we get here with a stale buffer, we must not unlock the > + * buffer as the last BLI reference owns lock context, not us. > + */ > + if (stale || hold) > return; > - ASSERT(!stale || aborted); > xfs_buf_relse(bp); > } > > diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h > index 50dd79b59cf5..416890b84f8c 100644 > --- a/fs/xfs/xfs_buf_item.h > +++ b/fs/xfs/xfs_buf_item.h > @@ -49,7 +49,7 @@ struct xfs_buf_log_item { > > int xfs_buf_item_init(struct xfs_buf *, struct xfs_mount *); > void xfs_buf_item_done(struct xfs_buf *bp); > -bool xfs_buf_item_put(struct xfs_buf_log_item *); > +void xfs_buf_item_put(struct xfs_buf_log_item *bip); > void xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint); > bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *); > void xfs_buf_inode_iodone(struct xfs_buf *); > -- > 2.45.2 > > Reviewed-by: Carlos Maiolino <cmaiolino@xxxxxxxxxx>