Re: [PATCH 1/2] xfs: replace iclogs circular list with a list_head

Carlos Maiolino <cem@xxxxxxxxxx> · Tue, 24 Jun 2025 06:43:27 +0200

On Tue, Jun 24, 2025 at 12:16:32PM +1000, Dave Chinner wrote:
> On Fri, Jun 20, 2025 at 09:07:59AM +0200, cem@xxxxxxxxxx wrote:
> > From: Carlos Maiolino <cem@xxxxxxxxxx>
> >
> > Instead of using ic_{next,prev}, replace it with list_head framework
> > to simplify its use.
> >
> > This has a small logic change:
> >
> > So far log->l_iclog holds the current iclog pointer and moves this
> > pointer sequentially to the next iclog in the ring.
> >
> > Instead of keeping a separated iclog pointer as the 'current' one, make
> > the first list element the current iclog.
> > Once we mark the iclog as WANT_SYNC, just move it to the list tail,
> > making the the next iclog as the 'current' one.
> 
> Hmmmm. I don't see a problem with using a list head for the ring,
> But I do see a problem with making the ring mutable.
> 
> The current code sets up the iclog list once and mount, and
> it is read-only from then on. i.e. we never modify the iclog ring
> pointers from then on, and the only thing that changes is the
> pointer to the first iclog.
> 
> This means it is always safe to walk the iclog ring, regardless of
> whether the icloglock is held or not.  I know there are asserts that
> walk the ring without the icloglock held, not sure about the rest of
> the code.

This makes sense, thanks for the explanation, I also had another idea of keeping
a different pointer for the current active iclog, instead of modifying the ring,
i.e. keep the ring immutable, and just update an active pointer to point to the
current iclog. Does it make sense to attempt that or should I just scratch this
idea?

Cheers.
Carlos

> 
> It also means that shared cacheline access is all that is needed to
> walk the ring, and because the ring is not mutable, updating the
> first iclog in the ring (i.e. writing to log->l_iclog) doesn't
> change the shared CPU cache state of the iclog ring pointers.
> 
> Converting it to a list head and making the iclog list mutable by
> moving items from head to tail instead of just changing which item
> log->l_iclog points to means it is no longer safe to walk the iclog
> ring without holding the icloglock.
> 
> Further, the list_move_tail() call to update the first iclog in the
> ring now has to modify the list head cache line (i.e. log->iclog)
> and the list pointers for the iclog we are moving, the second iclog
> in the list that now becomes the head, and the old tail of the list
> we are inserting behind.
> 
> IOWs, every time we switch to a new iclog, we now dirty 4 cachelines
> instead of just 1 (log->l_iclog). When the log is running hard (e.g.
> at 600MB/s on 32kB iclogs) we are switching iclogs around 20,000
> times a second. Hence this change results in a *lot* more cacheline
> dirtying in a fast path than we currently do, and that will likely
> have measurable performance impact.
> 
> Further, we generally touch those cachelines next in interrupt
> context, so now journal IO completion will be having to flush those
> cachelines from the cache of a different CPU so they can be accessed
> whilst walking the iclog ring to complete iclogs in order. This will
> likely also have measurable impact on journal IO completion as well.
> 
> Hence I think that the ring should remain immutable and the
> log->l_iclog pointer retained to index the first object in the ring.
> This means we don't need a list head in the struct xlog for the
> iclog ring, we can have the ring simply contain just the iclogs as
> they currently do.
> 
> 
> > @@ -476,8 +476,7 @@ xlog_state_shutdown_callbacks(
> >  	struct xlog_in_core	*iclog;
> >  	LIST_HEAD(cb_list);
> >
> > -	iclog = log->l_iclog;
> > -	do {
> > +	list_for_each_entry(iclog, &log->l_iclogs, ic_list) {
> >  		if (atomic_read(&iclog->ic_refcnt)) {
> >  			/* Reference holder will re-run iclog callbacks. */
> >  			continue;
> > @@ -490,7 +489,7 @@ xlog_state_shutdown_callbacks(
> >  		spin_lock(&log->l_icloglock);
> >  		wake_up_all(&iclog->ic_write_wait);
> >  		wake_up_all(&iclog->ic_force_wait);
> > -	} while ((iclog = iclog->ic_next) != log->l_iclog);
> > +	}
> 
> This is likely broken by the ring being made mutable. The
> l_icloglock is dropped in the middle of the list traversal, meaning
> the ring order can change whilst callbacks are running. It is
> critical that this operation occurs in ascending LSN order.
> 
> This is why the ring is immutable; we can walk around the ring
> multiple times here whilst submission and completion is occurring
> concurrently with callback processing.
> 
> Same goes for xlog_state_do_callback ->
> xlog_state_do_iclog_callbacks(), especially the bit about always
> iterating iclogs in ascending LSN order.
> 
> >  	wake_up_all(&log->l_flush_wait);
> >  }
> > @@ -810,13 +809,11 @@ xlog_force_iclog(
> >  static void
> >  xlog_wait_iclog_completion(struct xlog *log)
> >  {
> > -	int		i;
> > -	struct xlog_in_core	*iclog = log->l_iclog;
> > +	struct xlog_in_core	*iclog;
> >
> > -	for (i = 0; i < log->l_iclog_bufs; i++) {
> > +	list_for_each_entry(iclog, &log->l_iclogs, ic_list) {
> >  		down(&iclog->ic_sema);
> >  		up(&iclog->ic_sema);
> > -		iclog = iclog->ic_next;
> >  	}
> >  }
> 
> This is called without the l_icloglock held, so if the list is
> mutable this can go wrong....
> 
> > @@ -2486,19 +2471,17 @@ xlog_state_do_iclog_callbacks(
> >  		__releases(&log->l_icloglock)
> >  		__acquires(&log->l_icloglock)
> >  {
> > -	struct xlog_in_core	*first_iclog = log->l_iclog;
> > -	struct xlog_in_core	*iclog = first_iclog;
> > +	struct xlog_in_core	*iclog;
> >  	bool			ran_callback = false;
> >
> > -	do {
> > +	list_for_each_entry(iclog, &log->l_iclogs, ic_list) {
> >  		LIST_HEAD(cb_list);
> >
> >  		if (xlog_state_iodone_process_iclog(log, iclog))
> >  			break;
> > -		if (iclog->ic_state != XLOG_STATE_CALLBACK) {
> > -			iclog = iclog->ic_next;
> > +		if (iclog->ic_state != XLOG_STATE_CALLBACK)
> >  			continue;
> > -		}
> > +
> >  		list_splice_init(&iclog->ic_callbacks, &cb_list);
> >  		spin_unlock(&log->l_icloglock);
> >
> > @@ -2509,8 +2492,7 @@ xlog_state_do_iclog_callbacks(
> >
> >  		spin_lock(&log->l_icloglock);
> >  		xlog_state_clean_iclog(log, iclog);
> > -		iclog = iclog->ic_next;
> > -	} while (iclog != first_iclog);
> > +	}
> 
> As per above, the icloglock is dropped during iteration here...
> 
> > @@ -2913,7 +2898,7 @@ xfs_log_force(
> >  		 * is nothing to sync out. Otherwise, we attach ourselves to the
> >  		 * previous iclog and go to sleep.
> >  		 */
> > -		iclog = iclog->ic_prev;
> > +		iclog = list_prev_entry_circular(iclog, &log->l_iclogs, ic_list);
> 
> That's not really an improvement. :/
> 
> But if we just make the iclogs a circular list without the
> log->l_iclogs head, then it's just list_prev_entry().
> 
> Still not sure this is better than the current code....
> 
> > @@ -3333,12 +3319,8 @@ xlog_verify_iclog(
> >
> >  	/* check validity of iclog pointers */
> >  	spin_lock(&log->l_icloglock);
> > -	icptr = log->l_iclog;
> > -	for (i = 0; i < log->l_iclog_bufs; i++, icptr = icptr->ic_next)
> > +	list_for_each_entry(icptr, &log->l_iclogs, ic_list)
> >  		ASSERT(icptr);
> 
> This needs to count the number of iclogs in the list, check it
> against log->l_iclog_bufs...
> 
> > -
> > -	if (icptr != log->l_iclog)
> > -		xfs_emerg(log->l_mp, "%s: corrupt iclog ring", __func__);
> 
> .... because that is what this checks.
> 
> >  	spin_unlock(&log->l_icloglock);
> >
> >  	/* check log magic numbers */
> > @@ -3531,17 +3513,15 @@ STATIC int
> >  xlog_iclogs_empty(
> >  	struct xlog	*log)
> >  {
> > -	xlog_in_core_t	*iclog;
> > +	struct xlog_in_core	*iclog;
> >
> > -	iclog = log->l_iclog;
> > -	do {
> > +	list_for_each_entry(iclog, &log->l_iclogs, ic_list) {
> >  		/* endianness does not matter here, zero is zero in
> >  		 * any language.
> >  		 */
> >  		if (iclog->ic_header.h_num_logops)
> >  			return 0;
> > -		iclog = iclog->ic_next;
> > -	} while (iclog != log->l_iclog);
> > +	}
> >  	return 1;
> 
> Called without icloglock held from debug code.
> 
> -Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx