Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 9 Sep 2025 10:18:59 +1000

On Mon, Sep 08, 2025 at 05:12:49PM +0200, Jan Kara wrote:
> Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log()
> significantly increases contention on ILOCK for O_DSYNC | O_DIRECT
> writes to file preallocated with fallocate (thus DIO happens to
> unwritten extents and we need ILOCK in exclusive mode for timestamp
> modifications and extent conversions). But holding ILOCK over the log
> force doesn't seem strictly necessary for correctness.

That was introduced a long while back in 2015 when the
ili_fsync_fields flags were introduced to optimise O_DSYNC to avoid
timestamp updates from causing log forces. That was commit
fc0561cefc04 ("xfs: optimise away log forces on timestamp updates for
fdatasync").

> We are just using
> it for a mechanism to make sure parallel fsyncs all wait for log force
> to complete but that can be also achieved without holding ILOCK.

Not exactly. It was used to ensure that we correctly serialised
the setting of newly dirty fsync flags against the log force that
allows us to clearing the existing fsync flags. It requires the
ILOCK_EXCL to set new flags, hence holding the ILOCK_SHARED was
sufficient to acheive this.

At the time, the ili_lock did not exist (introduced in 2020), so the
only way to serialise inode log item flags updates was to use the
ILOCK. But now with the ili_lock, we can update fields safely
without holding the ILOCK...

> With this patch DB2 database restore operation speeds up by a factor of
> about 2.5x in a VM with 4 CPUs, 16GB of RAM and NVME SSD as a backing
> store.

*nod*

> Signed-off-by: Jan Kara <jack@xxxxxxx>
> ---
>  fs/xfs/xfs_file.c       | 33 ++++++++++++++++++++++-----------
>  fs/xfs/xfs_inode_item.c |  1 +
>  fs/xfs/xfs_inode_item.h |  1 +
>  3 files changed, 24 insertions(+), 11 deletions(-)
> 
> I've chosen adding ili_fsync_flushing_fields to xfs_inode_log_item since that
> seemed in line with how the code is structured. Arguably that is unnecessarily
> wasteful since in practice we use just one bit of information from
> ili_fsync_fields and one bit from ili_fsync_flushing_fields. If people prefer
> more space efficient solution, I can certainly do that.

So you are trying to mirror the ili_fields -> ili_last_fields
behaviour w.r.t. inode flushing?

> This is marked as RFC because I'm not quite sure I didn't miss some subtlety
> in XFS logging mechanisms and this will crash & burn badly in some corner
> case (but fstests seem to be passing fine for me).

Yeah, that's the issue, isn't it?

> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index b04c59d87378..2bb793c8c179 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -80,9 +80,13 @@ xfs_fsync_seq(
>  	struct xfs_inode	*ip,
>  	bool			datasync)
>  {
> +	unsigned int sync_fields;
> +
>  	if (!xfs_ipincount(ip))
>  		return 0;
> -	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +	sync_fields = ip->i_itemp->ili_fsync_fields |
> +		      ip->i_itemp->ili_fsync_flushing_fields;

Racy read - the new ili_fsync_flushing_fields field is protected by
ili_lock, not ILOCK....

> +	if (datasync && !(sync_fields & ~XFS_ILOG_TIMESTAMP))
>  		return 0;
>  	return ip->i_itemp->ili_commit_seq;
>  }

These xfs_fsync_seq() checks are the real reason we need to hold the
ILOCK for fsync correctness.

The ili_fsync_fields, pin count and the ili_commit_seq are updated
as part of the transaction commit processing. The only lock they
share across this context is the ILOCK_EXCL, and the update order is
this:

__xfs_trans_commit
  xfs_trans_run_precommits
    ->iop_precommit
      xfs_inode_item_precommit		>>>> updates ili_fsync_fields
  xlog_cil_commit
    xlog_cil_insert_items
      xlog_cil_insert_format_items
        xfs_cil_prepare_item
          ->iop_pin
	    xfs_inode_item_pin()	>>>> pin count increment
    ->iop_committing(seq)
      xfs_inode_item_committing(seq)    >>>> writes ili_commit_seq
        xfs_inode_item_release()
	  xfs_iunlock(ILOCK_EXCL)	>>>> inode now unlocked

Hence in xfs_file_fsync(), we have to hold the ILOCK_SHARED to
determine if the log force needs to be done.

The ILOCK was extended to cover the log force in commit fc0561cefc04
because we needed to clear the ili_fsync_fields after the log force,
but it had to be done in a way that avoided racing with setting new
fsync bits in the transaction commit. Using the ILOCK was the only
way to do that, so the lock was then held across the log force...

Ok, so it look slike it is safe to drop the ILOCK across the log
force as long as we have some other way to ensure we don't drop
newly dirtied fsync fields on the ground.

> @@ -92,13 +96,14 @@ xfs_fsync_seq(
>   * log up to the latest LSN that touched the inode.
>   *
>   * If we have concurrent fsync/fdatasync() calls, we need them to all block on
> - * the log force before we clear the ili_fsync_fields field. This ensures that
> - * we don't get a racing sync operation that does not wait for the metadata to
> - * hit the journal before returning.  If we race with clearing ili_fsync_fields,
> - * then all that will happen is the log force will do nothing as the lsn will
> - * already be on disk.  We can't race with setting ili_fsync_fields because that
> - * is done under XFS_ILOCK_EXCL, and that can't happen because we hold the lock
> - * shared until after the ili_fsync_fields is cleared.
> + * the log force until it is finished. Thus we clear ili_fsync_fields so that
> + * new modifications since starting log force can accumulate there and just
> + * save old ili_fsync_fields value to ili_fsync_flushing_fields so that
> + * concurrent fsyncs can use that to determine whether they need to wait for
> + * running log force or not. This ensures that we don't get a racing sync
> + * operation that does not wait for the metadata to hit the journal before
> + * returning.  If we race with clearing ili_fsync_fields, then all that will
> + * happen is the log force will do nothing as the lsn will already be on disk.
>   */
>  static  int
>  xfs_fsync_flush_log(
> @@ -112,14 +117,20 @@ xfs_fsync_flush_log(
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	seq = xfs_fsync_seq(ip, datasync);
>  	if (seq) {
> +		spin_lock(&ip->i_itemp->ili_lock);
> +		ip->i_itemp->ili_fsync_flushing_fields =
> +						ip->i_itemp->ili_fsync_fields;
> +		ip->i_itemp->ili_fsync_fields = 0;
> +		spin_unlock(&ip->i_itemp->ili_lock);

This is racy. if we get three fdatasync()s at the same time, they
can do:

t0			t1			t2

fsync = ILOG_CORE
flushing = ILOG_CORE
fsync = 0
<log force>
			xfs_fsync_seq
			  fields = ILOG_CORE
			fsync = 0
			flushing = 0
			fsync = 0
			<log force>
						xfs_fsync_seq
						  fields = 0
						<skips datasync>
......
<log force completes>

In this case t2 should have waited on the log force like t0 and t1,
but the lack of dirty fsync fields has allowed it to skip them.

We avoid these problems with ili_fields/ili_last_fields by always
ORing the ili_last_fields back into the ili_fields whenever we
update it. This means we don't lose bits that were set by previous
operations that are still in flight.

> +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
>  					  log_flushed);
> -
>  		spin_lock(&ip->i_itemp->ili_lock);
> -		ip->i_itemp->ili_fsync_fields = 0;
> +		ip->i_itemp->ili_fsync_flushing_fields = 0;
>  		spin_unlock(&ip->i_itemp->ili_lock);
> +	} else {
> +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  	}
> -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  	return error;
>  }

Hence it seems better to reintegrate xfs_fsync_seq() because it
cleans up the locking and logic. I'd also call it
"ili_last_fsync_fields" to match the naming that the inode flushing
code uses (i.e. ili_last_fields):

	struct xfs_inode_log_item *iip = ip->i_itemp;
	unsigned int		sync_fields;
	xfs_csn_t		seq = 0;

	xfs_ilock(ip, XFS_ILOCK_SHARED);
	if (!xfs_ipincount(ip)) {
		xfs_iunlock(ip, XFS_ILOCK_SHARED);
		return 0;
	}

	spin_lock(&iip->ili_lock);
	sync_fields = iip->ili_fsync_fields | iip->ili_last_fsync_fields;

	/*
	 * Don't force the log for O_DSYNC operations on inodes that
	 * only have dirty timestamps. Timestamps are not necessary
	 * for data integrity so we can skip them in this case.
	 */
	if (!datasync || (sync_fields & ~XFS_ILOG_TIMESTAMP))
		seq = iip->ili_commit_seq;
		iip->ili_last_fsync_fields |= iip->ili_fsync_fields;
		iip->ili_fsync_fields = 0;
	}
	spin_unlock(&iip->ili_lock);
	xfs_iunlock(ip, XFS_ILOCK_SHARED);

	if (!seq)
		return 0;

	error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
				  log_flushed);
	spin_lock(&iip->ili_lock);
	iip->ili_last_fsync_fields = 0;
	spin_unlock(&iip->ili_lock);

	return error;
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 829675700fcd..39d15eb9311d 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -1056,6 +1056,7 @@ xfs_iflush_abort_clean(
>  	iip->ili_last_fields = 0;
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
> +	iip->ili_fsync_flushing_fields = 0;
>  	iip->ili_flush_lsn = 0;
>  	iip->ili_item.li_buf = NULL;
>  	list_del_init(&iip->ili_item.li_bio_list);

I also suspect that ili_last_fsync_fields needs to be cleared in
xfs_iflush() at the same time ili_fsync_fields is cleared. i.e.
flushing for writeback absolutely clears the fsync status of the
inode because, by definition, the inode is not pinned in the journal
and so is not dirty in memory. Hence a fsync at that point should
result in a no-op....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx