Re: [PATCH RFC] xfs: Don't hold XFS_ILOCK_SHARED over log force during fsync

Jan Kara <jack@xxxxxxx> · Wed, 10 Sep 2025 11:05:18 +0200

On Tue 09-09-25 10:18:59, Dave Chinner wrote:
> On Mon, Sep 08, 2025 at 05:12:49PM +0200, Jan Kara wrote:
> > Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log()
> > significantly increases contention on ILOCK for O_DSYNC | O_DIRECT
> > writes to file preallocated with fallocate (thus DIO happens to
> > unwritten extents and we need ILOCK in exclusive mode for timestamp
> > modifications and extent conversions). But holding ILOCK over the log
> > force doesn't seem strictly necessary for correctness.
> 
> That was introduced a long while back in 2015 when the
> ili_fsync_fields flags were introduced to optimise O_DSYNC to avoid
> timestamp updates from causing log forces. That was commit
> fc0561cefc04 ("xfs: optimise away log forces on timestamp updates for
> fdatasync").
> 
> > We are just using
> > it for a mechanism to make sure parallel fsyncs all wait for log force
> > to complete but that can be also achieved without holding ILOCK.
> 
> Not exactly. It was used to ensure that we correctly serialised
> the setting of newly dirty fsync flags against the log force that
> allows us to clearing the existing fsync flags. It requires the
> ILOCK_EXCL to set new flags, hence holding the ILOCK_SHARED was
> sufficient to acheive this.
> 
> At the time, the ili_lock did not exist (introduced in 2020), so the
> only way to serialise inode log item flags updates was to use the
> ILOCK. But now with the ili_lock, we can update fields safely
> without holding the ILOCK...
> 
> > With this patch DB2 database restore operation speeds up by a factor of
> > about 2.5x in a VM with 4 CPUs, 16GB of RAM and NVME SSD as a backing
> > store.
> 
> *nod*
> 
> > Signed-off-by: Jan Kara <jack@xxxxxxx>
> > ---
> >  fs/xfs/xfs_file.c       | 33 ++++++++++++++++++++++-----------
> >  fs/xfs/xfs_inode_item.c |  1 +
> >  fs/xfs/xfs_inode_item.h |  1 +
> >  3 files changed, 24 insertions(+), 11 deletions(-)
> > 
> > I've chosen adding ili_fsync_flushing_fields to xfs_inode_log_item since that
> > seemed in line with how the code is structured. Arguably that is unnecessarily
> > wasteful since in practice we use just one bit of information from
> > ili_fsync_fields and one bit from ili_fsync_flushing_fields. If people prefer
> > more space efficient solution, I can certainly do that.
> 
> So you are trying to mirror the ili_fields -> ili_last_fields
> behaviour w.r.t. inode flushing?

Kind of yes.

> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index b04c59d87378..2bb793c8c179 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -80,9 +80,13 @@ xfs_fsync_seq(
> >  	struct xfs_inode	*ip,
> >  	bool			datasync)
> >  {
> > +	unsigned int sync_fields;
> > +
> >  	if (!xfs_ipincount(ip))
> >  		return 0;
> > -	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > +	sync_fields = ip->i_itemp->ili_fsync_fields |
> > +		      ip->i_itemp->ili_fsync_flushing_fields;
> 
> Racy read - the new ili_fsync_flushing_fields field is protected by
> ili_lock, not ILOCK....

Yup, luckily easy to fix.

> > +	if (datasync && !(sync_fields & ~XFS_ILOG_TIMESTAMP))
> >  		return 0;
> >  	return ip->i_itemp->ili_commit_seq;
> >  }
> 
> These xfs_fsync_seq() checks are the real reason we need to hold the
> ILOCK for fsync correctness.
> 
> The ili_fsync_fields, pin count and the ili_commit_seq are updated
> as part of the transaction commit processing. The only lock they
> share across this context is the ILOCK_EXCL, and the update order is
> this:
> 
> __xfs_trans_commit
>   xfs_trans_run_precommits
>     ->iop_precommit
>       xfs_inode_item_precommit		>>>> updates ili_fsync_fields
>   xlog_cil_commit
>     xlog_cil_insert_items
>       xlog_cil_insert_format_items
>         xfs_cil_prepare_item
>           ->iop_pin
> 	    xfs_inode_item_pin()	>>>> pin count increment
>     ->iop_committing(seq)
>       xfs_inode_item_committing(seq)    >>>> writes ili_commit_seq
>         xfs_inode_item_release()
> 	  xfs_iunlock(ILOCK_EXCL)	>>>> inode now unlocked
> 
> 
> Hence in xfs_file_fsync(), we have to hold the ILOCK_SHARED to
> determine if the log force needs to be done.
> 
> The ILOCK was extended to cover the log force in commit fc0561cefc04
> because we needed to clear the ili_fsync_fields after the log force,
> but it had to be done in a way that avoided racing with setting new
> fsync bits in the transaction commit. Using the ILOCK was the only
> way to do that, so the lock was then held across the log force...
> 
> Ok, so it look slike it is safe to drop the ILOCK across the log
> force as long as we have some other way to ensure we don't drop
> newly dirtied fsync fields on the ground.

Thanks for the details!

> > @@ -112,14 +117,20 @@ xfs_fsync_flush_log(
> >  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> >  	seq = xfs_fsync_seq(ip, datasync);
> >  	if (seq) {
> > +		spin_lock(&ip->i_itemp->ili_lock);
> > +		ip->i_itemp->ili_fsync_flushing_fields =
> > +						ip->i_itemp->ili_fsync_fields;
> > +		ip->i_itemp->ili_fsync_fields = 0;
> > +		spin_unlock(&ip->i_itemp->ili_lock);
> 
> This is racy. if we get three fdatasync()s at the same time, they
> can do:
> 
> t0			t1			t2
> 
> fsync = ILOG_CORE
> flushing = ILOG_CORE
> fsync = 0
> <log force>
> 			xfs_fsync_seq
> 			  fields = ILOG_CORE
> 			fsync = 0
> 			flushing = 0
> 			fsync = 0
> 			<log force>
> 						xfs_fsync_seq
> 						  fields = 0
> 						<skips datasync>
> ......
> <log force completes>
> 
> In this case t2 should have waited on the log force like t0 and t1,
> but the lack of dirty fsync fields has allowed it to skip them.
> 
> We avoid these problems with ili_fields/ili_last_fields by always
> ORing the ili_last_fields back into the ili_fields whenever we
> update it. This means we don't lose bits that were set by previous
> operations that are still in flight.

Good spotting!

> > +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> >  					  log_flushed);
> > -
> >  		spin_lock(&ip->i_itemp->ili_lock);
> > -		ip->i_itemp->ili_fsync_fields = 0;
> > +		ip->i_itemp->ili_fsync_flushing_fields = 0;
> >  		spin_unlock(&ip->i_itemp->ili_lock);
> > +	} else {
> > +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  	}
> > -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  	return error;
> >  }
> 
> Hence it seems better to reintegrate xfs_fsync_seq() because it
> cleans up the locking and logic. I'd also call it
> "ili_last_fsync_fields" to match the naming that the inode flushing
> code uses (i.e. ili_last_fields):
> 
> 	struct xfs_inode_log_item *iip = ip->i_itemp;
> 	unsigned int		sync_fields;
> 	xfs_csn_t		seq = 0;
> 
> 	xfs_ilock(ip, XFS_ILOCK_SHARED);
> 	if (!xfs_ipincount(ip)) {
> 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> 		return 0;
> 	}
> 
> 	spin_lock(&iip->ili_lock);
> 	sync_fields = iip->ili_fsync_fields | iip->ili_last_fsync_fields;
> 
> 	/*
> 	 * Don't force the log for O_DSYNC operations on inodes that
> 	 * only have dirty timestamps. Timestamps are not necessary
> 	 * for data integrity so we can skip them in this case.
> 	 */
> 	if (!datasync || (sync_fields & ~XFS_ILOG_TIMESTAMP))
> 		seq = iip->ili_commit_seq;
> 		iip->ili_last_fsync_fields |= iip->ili_fsync_fields;
> 		iip->ili_fsync_fields = 0;
> 	}
> 	spin_unlock(&iip->ili_lock);
> 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> 
> 	if (!seq)
> 		return 0;
> 
> 	error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
> 				  log_flushed);
> 	spin_lock(&iip->ili_lock);
> 	iip->ili_last_fsync_fields = 0;
> 	spin_unlock(&iip->ili_lock);
> 
> 	return error;

After some thinking... But isn't this still racy in a different way? Like:

t0			t1			t2

fsync = ILOG_CORE
last_fsync = ILOG_CORE
fsync = 0
<log force lsn1>
			modifies file - this is now possible because we
			  don't hold ILOCK anymore
			calls fsync
			so:
			fsync = ILOG_CORE
			last_fsync = ILOG_CORE
			fsync = 0
			<log force lsn2>
......
<log force lsn1 completes>
last_fsync = 0
						fsync
						fsync == 0 && last_fsync == 0 =>
						  <skips fsync>

Hence t2 didn't wait until log force of lsn2 completed which looks like a
bug.

The problem is last_fsync was cleared too early. We'd need to clear it only
once the latest log force for the inode has finished (which presumably
doable by remembering the latest forced seq in xfs_inode_log_item).

But I'm wondering whether we aren't overcomplicating this. Cannot we just
instead of ili_fsync_fields maintain the latest seq when the inode has been
modified in a way relevant for fdatasync? Let's call that
ili_datasync_commit_seq. Then in case of datasync we'll call
xfs_log_force_seq() for ili_datasync_commit_seq and in case of fsync for
ili_commit_seq. No need for complex clearing of ili_fsync_fields... The
only thing I'm not sure about is how to best handle the ili_fsync_fields
check in xfs_bmbt_to_iomap() - I haven't found a function to check whether a
particular seq has been already committed or not.

Tentative patch attached.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR