On Mon, Sep 08, 2025 at 05:12:49PM +0200, Jan Kara wrote: > Holding XFS_ILOCK_SHARED over log force in xfs_fsync_flush_log() > significantly increases contention on ILOCK for O_DSYNC | O_DIRECT > writes to file preallocated with fallocate (thus DIO happens to > unwritten extents and we need ILOCK in exclusive mode for timestamp > modifications and extent conversions). But holding ILOCK over the log > force doesn't seem strictly necessary for correctness. That was introduced a long while back in 2015 when the ili_fsync_fields flags were introduced to optimise O_DSYNC to avoid timestamp updates from causing log forces. That was commit fc0561cefc04 ("xfs: optimise away log forces on timestamp updates for fdatasync"). > We are just using > it for a mechanism to make sure parallel fsyncs all wait for log force > to complete but that can be also achieved without holding ILOCK. Not exactly. It was used to ensure that we correctly serialised the setting of newly dirty fsync flags against the log force that allows us to clearing the existing fsync flags. It requires the ILOCK_EXCL to set new flags, hence holding the ILOCK_SHARED was sufficient to acheive this. At the time, the ili_lock did not exist (introduced in 2020), so the only way to serialise inode log item flags updates was to use the ILOCK. But now with the ili_lock, we can update fields safely without holding the ILOCK... > With this patch DB2 database restore operation speeds up by a factor of > about 2.5x in a VM with 4 CPUs, 16GB of RAM and NVME SSD as a backing > store. *nod* > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > fs/xfs/xfs_file.c | 33 ++++++++++++++++++++++----------- > fs/xfs/xfs_inode_item.c | 1 + > fs/xfs/xfs_inode_item.h | 1 + > 3 files changed, 24 insertions(+), 11 deletions(-) > > I've chosen adding ili_fsync_flushing_fields to xfs_inode_log_item since that > seemed in line with how the code is structured. Arguably that is unnecessarily > wasteful since in practice we use just one bit of information from > ili_fsync_fields and one bit from ili_fsync_flushing_fields. If people prefer > more space efficient solution, I can certainly do that. So you are trying to mirror the ili_fields -> ili_last_fields behaviour w.r.t. inode flushing? > This is marked as RFC because I'm not quite sure I didn't miss some subtlety > in XFS logging mechanisms and this will crash & burn badly in some corner > case (but fstests seem to be passing fine for me). Yeah, that's the issue, isn't it? > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index b04c59d87378..2bb793c8c179 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -80,9 +80,13 @@ xfs_fsync_seq( > struct xfs_inode *ip, > bool datasync) > { > + unsigned int sync_fields; > + > if (!xfs_ipincount(ip)) > return 0; > - if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > + sync_fields = ip->i_itemp->ili_fsync_fields | > + ip->i_itemp->ili_fsync_flushing_fields; Racy read - the new ili_fsync_flushing_fields field is protected by ili_lock, not ILOCK.... > + if (datasync && !(sync_fields & ~XFS_ILOG_TIMESTAMP)) > return 0; > return ip->i_itemp->ili_commit_seq; > } These xfs_fsync_seq() checks are the real reason we need to hold the ILOCK for fsync correctness. The ili_fsync_fields, pin count and the ili_commit_seq are updated as part of the transaction commit processing. The only lock they share across this context is the ILOCK_EXCL, and the update order is this: __xfs_trans_commit xfs_trans_run_precommits ->iop_precommit xfs_inode_item_precommit >>>> updates ili_fsync_fields xlog_cil_commit xlog_cil_insert_items xlog_cil_insert_format_items xfs_cil_prepare_item ->iop_pin xfs_inode_item_pin() >>>> pin count increment ->iop_committing(seq) xfs_inode_item_committing(seq) >>>> writes ili_commit_seq xfs_inode_item_release() xfs_iunlock(ILOCK_EXCL) >>>> inode now unlocked Hence in xfs_file_fsync(), we have to hold the ILOCK_SHARED to determine if the log force needs to be done. The ILOCK was extended to cover the log force in commit fc0561cefc04 because we needed to clear the ili_fsync_fields after the log force, but it had to be done in a way that avoided racing with setting new fsync bits in the transaction commit. Using the ILOCK was the only way to do that, so the lock was then held across the log force... Ok, so it look slike it is safe to drop the ILOCK across the log force as long as we have some other way to ensure we don't drop newly dirtied fsync fields on the ground. > @@ -92,13 +96,14 @@ xfs_fsync_seq( > * log up to the latest LSN that touched the inode. > * > * If we have concurrent fsync/fdatasync() calls, we need them to all block on > - * the log force before we clear the ili_fsync_fields field. This ensures that > - * we don't get a racing sync operation that does not wait for the metadata to > - * hit the journal before returning. If we race with clearing ili_fsync_fields, > - * then all that will happen is the log force will do nothing as the lsn will > - * already be on disk. We can't race with setting ili_fsync_fields because that > - * is done under XFS_ILOCK_EXCL, and that can't happen because we hold the lock > - * shared until after the ili_fsync_fields is cleared. > + * the log force until it is finished. Thus we clear ili_fsync_fields so that > + * new modifications since starting log force can accumulate there and just > + * save old ili_fsync_fields value to ili_fsync_flushing_fields so that > + * concurrent fsyncs can use that to determine whether they need to wait for > + * running log force or not. This ensures that we don't get a racing sync > + * operation that does not wait for the metadata to hit the journal before > + * returning. If we race with clearing ili_fsync_fields, then all that will > + * happen is the log force will do nothing as the lsn will already be on disk. > */ > static int > xfs_fsync_flush_log( > @@ -112,14 +117,20 @@ xfs_fsync_flush_log( > xfs_ilock(ip, XFS_ILOCK_SHARED); > seq = xfs_fsync_seq(ip, datasync); > if (seq) { > + spin_lock(&ip->i_itemp->ili_lock); > + ip->i_itemp->ili_fsync_flushing_fields = > + ip->i_itemp->ili_fsync_fields; > + ip->i_itemp->ili_fsync_fields = 0; > + spin_unlock(&ip->i_itemp->ili_lock); This is racy. if we get three fdatasync()s at the same time, they can do: t0 t1 t2 fsync = ILOG_CORE flushing = ILOG_CORE fsync = 0 <log force> xfs_fsync_seq fields = ILOG_CORE fsync = 0 flushing = 0 fsync = 0 <log force> xfs_fsync_seq fields = 0 <skips datasync> ...... <log force completes> In this case t2 should have waited on the log force like t0 and t1, but the lack of dirty fsync fields has allowed it to skip them. We avoid these problems with ili_fields/ili_last_fields by always ORing the ili_last_fields back into the ili_fields whenever we update it. This means we don't lose bits that were set by previous operations that are still in flight. > + xfs_iunlock(ip, XFS_ILOCK_SHARED); > error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, > log_flushed); > - > spin_lock(&ip->i_itemp->ili_lock); > - ip->i_itemp->ili_fsync_fields = 0; > + ip->i_itemp->ili_fsync_flushing_fields = 0; > spin_unlock(&ip->i_itemp->ili_lock); > + } else { > + xfs_iunlock(ip, XFS_ILOCK_SHARED); > } > - xfs_iunlock(ip, XFS_ILOCK_SHARED); > return error; > } Hence it seems better to reintegrate xfs_fsync_seq() because it cleans up the locking and logic. I'd also call it "ili_last_fsync_fields" to match the naming that the inode flushing code uses (i.e. ili_last_fields): struct xfs_inode_log_item *iip = ip->i_itemp; unsigned int sync_fields; xfs_csn_t seq = 0; xfs_ilock(ip, XFS_ILOCK_SHARED); if (!xfs_ipincount(ip)) { xfs_iunlock(ip, XFS_ILOCK_SHARED); return 0; } spin_lock(&iip->ili_lock); sync_fields = iip->ili_fsync_fields | iip->ili_last_fsync_fields; /* * Don't force the log for O_DSYNC operations on inodes that * only have dirty timestamps. Timestamps are not necessary * for data integrity so we can skip them in this case. */ if (!datasync || (sync_fields & ~XFS_ILOG_TIMESTAMP)) seq = iip->ili_commit_seq; iip->ili_last_fsync_fields |= iip->ili_fsync_fields; iip->ili_fsync_fields = 0; } spin_unlock(&iip->ili_lock); xfs_iunlock(ip, XFS_ILOCK_SHARED); if (!seq) return 0; error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed); spin_lock(&iip->ili_lock); iip->ili_last_fsync_fields = 0; spin_unlock(&iip->ili_lock); return error; > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 829675700fcd..39d15eb9311d 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -1056,6 +1056,7 @@ xfs_iflush_abort_clean( > iip->ili_last_fields = 0; > iip->ili_fields = 0; > iip->ili_fsync_fields = 0; > + iip->ili_fsync_flushing_fields = 0; > iip->ili_flush_lsn = 0; > iip->ili_item.li_buf = NULL; > list_del_init(&iip->ili_item.li_bio_list); I also suspect that ili_last_fsync_fields needs to be cleared in xfs_iflush() at the same time ili_fsync_fields is cleared. i.e. flushing for writeback absolutely clears the fsync status of the inode because, by definition, the inode is not pinned in the journal and so is not dirty in memory. Hence a fsync at that point should result in a no-op.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx