Re: [PATCH 1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock

Carlos Maiolino <cem@xxxxxxxxxx> · Thu, 26 Jun 2025 11:30:53 +0200



On Thu, Jun 26, 2025 at 08:48:54AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>


Reviewed-by: Carlos Maiolino <cmaiolino@xxxxxxxxxx>

> 
> Lock order of xfs_ifree_cluster() is cluster buffer -> try ILOCK
> -> IFLUSHING, except for the last inode in the cluster that is
> triggering the free. In that case, the lock order is ILOCK ->
> cluster buffer -> IFLUSHING.
> 
> xfs_iflush_cluster() uses cluster buffer -> try ILOCK -> IFLUSHING,
> so this can safely run concurrently with xfs_ifree_cluster().
> 
> xfs_inode_item_precommit() uses ILOCK -> cluster buffer, but this
> cannot race with xfs_ifree_cluster() so being in a different order
> will not trigger a deadlock.
> 
> xfs_reclaim_inode() during a filesystem shutdown uses ILOCK ->
> IFLUSHING -> cluster buffer via xfs_iflush_shutdown_abort(), and
> this deadlocks against xfs_ifree_cluster() like so:
> 
>  sysrq: Show Blocked State
>  task:kworker/10:37   state:D stack:12560 pid:276182 tgid:276182 ppid:2      flags:0x00004000
>  Workqueue: xfs-inodegc/dm-3 xfs_inodegc_worker
>  Call Trace:
>   <TASK>
>   __schedule+0x650/0xb10
>   schedule+0x6d/0xf0
>   schedule_timeout+0x8b/0x180
>   schedule_timeout_uninterruptible+0x1e/0x30
>   xfs_ifree+0x326/0x730
>   xfs_inactive_ifree+0xcb/0x230
>   xfs_inactive+0x2c8/0x380
>   xfs_inodegc_worker+0xaa/0x180
>   process_scheduled_works+0x1d4/0x400
>   worker_thread+0x234/0x2e0
>   kthread+0x147/0x170
>   ret_from_fork+0x3e/0x50
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  task:fsync-tester    state:D stack:12160 pid:2255943 tgid:2255943 ppid:3988702 flags:0x00004006
>  Call Trace:
>   <TASK>
>   __schedule+0x650/0xb10
>   schedule+0x6d/0xf0
>   schedule_timeout+0x31/0x180
>   __down_common+0xbe/0x1f0
>   __down+0x1d/0x30
>   down+0x48/0x50
>   xfs_buf_lock+0x3d/0xe0
>   xfs_iflush_shutdown_abort+0x51/0x1e0
>   xfs_icwalk_ag+0x386/0x690
>   xfs_reclaim_inodes_nr+0x114/0x160
>   xfs_fs_free_cached_objects+0x19/0x20
>   super_cache_scan+0x17b/0x1a0
>   do_shrink_slab+0x180/0x350
>   shrink_slab+0xf8/0x430
>   drop_slab+0x97/0xf0
>   drop_caches_sysctl_handler+0x59/0xc0
>   proc_sys_call_handler+0x189/0x280
>   proc_sys_write+0x13/0x20
>   vfs_write+0x33d/0x3f0
>   ksys_write+0x7c/0xf0
>   __x64_sys_write+0x1b/0x30
>   x64_sys_call+0x271d/0x2ee0
>   do_syscall_64+0x68/0x130
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> We can't change the lock order of xfs_ifree_cluster() - XFS_ISTALE
> and XFS_IFLUSHING are serialised through to journal IO completion
> by the cluster buffer lock being held.
> 
> There's quite a few asserts in the code that check that XFS_ISTALE
> does not occur out of sync with buffer locking (e.g. in
> xfs_iflush_cluster). There's also a dependency on the inode log item
> being removed from the buffer before XFS_IFLUSHING is cleared, also
> with asserts that trigger on this.
> 
> Further, we don't have a requirement for the inode to be locked when
> completing or aborting inode flushing because all the inode state
> updates are serialised by holding the cluster buffer lock across the
> IO to completion.
> 
> We can't check for XFS_IRECLAIM in xfs_ifree_mark_inode_stale() and
> skip the inode, because there is no guarantee that the inode will be
> reclaimed. Hence it *must* be marked XFS_ISTALE regardless of
> whether reclaim is preparing to free that inode. Similarly, we can't
> check for IFLUSHING before locking the inode because that would
> result in dirty inodes not being marked with ISTALE in the event of
> racing with XFS_IRECLAIM.
> 
> Hence we have to address this issue from the xfs_reclaim_inode()
> side. It is clear that we cannot hold the inode locked here when
> calling xfs_iflush_shutdown_abort() because it is the inode->buffer
> lock order that causes the deadlock against xfs_ifree_cluster().
> 
> Hence we need to drop the ILOCK before aborting the inode in the
> shutdown case. Once we've aborted the inode, we can grab the ILOCK
> again and then immediately reclaim it as it is now guaranteed to be
> clean.
> 
> Note that dropping the ILOCK in xfs_reclaim_inode() means that it
> can now be locked by xfs_ifree_mark_inode_stale() and seen whilst in
> this state. This is safe because we have left the XFS_IFLUSHING flag
> on the inode and so xfs_ifree_mark_inode_stale() will simply set
> XFS_ISTALE and move to the next inode. An ASSERT check in this path
> needs to be tweaked to take into account this new shutdown
> interaction.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> ---
>  fs/xfs/xfs_icache.c | 8 ++++++++
>  fs/xfs/xfs_inode.c  | 2 +-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 726e29b837e6..bbc2f2973dcc 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -979,7 +979,15 @@ xfs_reclaim_inode(
>  	 */
>  	if (xlog_is_shutdown(ip->i_mount->m_log)) {
>  		xfs_iunpin_wait(ip);
> +		/*
> +		 * Avoid a ABBA deadlock on the inode cluster buffer vs
> +		 * concurrent xfs_ifree_cluster() trying to mark the inode
> +		 * stale. We don't need the inode locked to run the flush abort
> +		 * code, but the flush abort needs to lock the cluster buffer.
> +		 */
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		xfs_iflush_shutdown_abort(ip);
> +		xfs_ilock(ip, XFS_ILOCK_EXCL);
>  		goto reclaim;
>  	}
>  	if (xfs_ipincount(ip))
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ee3e0f284287..761a996a857c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1635,7 +1635,7 @@ xfs_ifree_mark_inode_stale(
>  	iip = ip->i_itemp;
>  	if (__xfs_iflags_test(ip, XFS_IFLUSHING)) {
>  		ASSERT(!list_empty(&iip->ili_item.li_bio_list));
> -		ASSERT(iip->ili_last_fields);
> +		ASSERT(iip->ili_last_fields || xlog_is_shutdown(mp->m_log));
>  		goto out_iunlock;
>  	}
> 
> --
> 2.45.2
> 
>