Re: [PATCH 1/2] xfs: replace iclogs circular list with a list_head

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jun 20, 2025 at 09:07:59AM +0200, cem@xxxxxxxxxx wrote:
> From: Carlos Maiolino <cem@xxxxxxxxxx>
> 
> Instead of using ic_{next,prev}, replace it with list_head framework
> to simplify its use.
> 
> This has a small logic change:
> 
> So far log->l_iclog holds the current iclog pointer and moves this
> pointer sequentially to the next iclog in the ring.
> 
> Instead of keeping a separated iclog pointer as the 'current' one, make
> the first list element the current iclog.
> Once we mark the iclog as WANT_SYNC, just move it to the list tail,
> making the the next iclog as the 'current' one.

Hmmmm. I don't see a problem with using a list head for the ring,
But I do see a problem with making the ring mutable.

The current code sets up the iclog list once and mount, and
it is read-only from then on. i.e. we never modify the iclog ring
pointers from then on, and the only thing that changes is the
pointer to the first iclog.

This means it is always safe to walk the iclog ring, regardless of
whether the icloglock is held or not.  I know there are asserts that
walk the ring without the icloglock held, not sure about the rest of
the code.

It also means that shared cacheline access is all that is needed to
walk the ring, and because the ring is not mutable, updating the
first iclog in the ring (i.e. writing to log->l_iclog) doesn't
change the shared CPU cache state of the iclog ring pointers.

Converting it to a list head and making the iclog list mutable by
moving items from head to tail instead of just changing which item
log->l_iclog points to means it is no longer safe to walk the iclog
ring without holding the icloglock.

Further, the list_move_tail() call to update the first iclog in the
ring now has to modify the list head cache line (i.e. log->iclog)
and the list pointers for the iclog we are moving, the second iclog
in the list that now becomes the head, and the old tail of the list
we are inserting behind.

IOWs, every time we switch to a new iclog, we now dirty 4 cachelines
instead of just 1 (log->l_iclog). When the log is running hard (e.g.
at 600MB/s on 32kB iclogs) we are switching iclogs around 20,000
times a second. Hence this change results in a *lot* more cacheline
dirtying in a fast path than we currently do, and that will likely
have measurable performance impact.

Further, we generally touch those cachelines next in interrupt
context, so now journal IO completion will be having to flush those
cachelines from the cache of a different CPU so they can be accessed
whilst walking the iclog ring to complete iclogs in order. This will
likely also have measurable impact on journal IO completion as well.

Hence I think that the ring should remain immutable and the
log->l_iclog pointer retained to index the first object in the ring.
This means we don't need a list head in the struct xlog for the
iclog ring, we can have the ring simply contain just the iclogs as
they currently do.


> @@ -476,8 +476,7 @@ xlog_state_shutdown_callbacks(
>  	struct xlog_in_core	*iclog;
>  	LIST_HEAD(cb_list);
>  
> -	iclog = log->l_iclog;
> -	do {
> +	list_for_each_entry(iclog, &log->l_iclogs, ic_list) {
>  		if (atomic_read(&iclog->ic_refcnt)) {
>  			/* Reference holder will re-run iclog callbacks. */
>  			continue;
> @@ -490,7 +489,7 @@ xlog_state_shutdown_callbacks(
>  		spin_lock(&log->l_icloglock);
>  		wake_up_all(&iclog->ic_write_wait);
>  		wake_up_all(&iclog->ic_force_wait);
> -	} while ((iclog = iclog->ic_next) != log->l_iclog);
> +	}

This is likely broken by the ring being made mutable. The
l_icloglock is dropped in the middle of the list traversal, meaning
the ring order can change whilst callbacks are running. It is
critical that this operation occurs in ascending LSN order.

This is why the ring is immutable; we can walk around the ring
multiple times here whilst submission and completion is occurring
concurrently with callback processing.

Same goes for xlog_state_do_callback ->
xlog_state_do_iclog_callbacks(), especially the bit about always
iterating iclogs in ascending LSN order.

>  	wake_up_all(&log->l_flush_wait);
>  }
> @@ -810,13 +809,11 @@ xlog_force_iclog(
>  static void
>  xlog_wait_iclog_completion(struct xlog *log)
>  {
> -	int		i;
> -	struct xlog_in_core	*iclog = log->l_iclog;
> +	struct xlog_in_core	*iclog;
>  
> -	for (i = 0; i < log->l_iclog_bufs; i++) {
> +	list_for_each_entry(iclog, &log->l_iclogs, ic_list) {
>  		down(&iclog->ic_sema);
>  		up(&iclog->ic_sema);
> -		iclog = iclog->ic_next;
>  	}
>  }

This is called without the l_icloglock held, so if the list is
mutable this can go wrong....

> @@ -2486,19 +2471,17 @@ xlog_state_do_iclog_callbacks(
>  		__releases(&log->l_icloglock)
>  		__acquires(&log->l_icloglock)
>  {
> -	struct xlog_in_core	*first_iclog = log->l_iclog;
> -	struct xlog_in_core	*iclog = first_iclog;
> +	struct xlog_in_core	*iclog;
>  	bool			ran_callback = false;
>  
> -	do {
> +	list_for_each_entry(iclog, &log->l_iclogs, ic_list) {
>  		LIST_HEAD(cb_list);
>  
>  		if (xlog_state_iodone_process_iclog(log, iclog))
>  			break;
> -		if (iclog->ic_state != XLOG_STATE_CALLBACK) {
> -			iclog = iclog->ic_next;
> +		if (iclog->ic_state != XLOG_STATE_CALLBACK)
>  			continue;
> -		}
> +
>  		list_splice_init(&iclog->ic_callbacks, &cb_list);
>  		spin_unlock(&log->l_icloglock);
>  
> @@ -2509,8 +2492,7 @@ xlog_state_do_iclog_callbacks(
>  
>  		spin_lock(&log->l_icloglock);
>  		xlog_state_clean_iclog(log, iclog);
> -		iclog = iclog->ic_next;
> -	} while (iclog != first_iclog);
> +	}

As per above, the icloglock is dropped during iteration here...

> @@ -2913,7 +2898,7 @@ xfs_log_force(
>  		 * is nothing to sync out. Otherwise, we attach ourselves to the
>  		 * previous iclog and go to sleep.
>  		 */
> -		iclog = iclog->ic_prev;
> +		iclog = list_prev_entry_circular(iclog, &log->l_iclogs, ic_list);

That's not really an improvement. :/

But if we just make the iclogs a circular list without the
log->l_iclogs head, then it's just list_prev_entry().

Still not sure this is better than the current code....

> @@ -3333,12 +3319,8 @@ xlog_verify_iclog(
>  
>  	/* check validity of iclog pointers */
>  	spin_lock(&log->l_icloglock);
> -	icptr = log->l_iclog;
> -	for (i = 0; i < log->l_iclog_bufs; i++, icptr = icptr->ic_next)
> +	list_for_each_entry(icptr, &log->l_iclogs, ic_list)
>  		ASSERT(icptr);

This needs to count the number of iclogs in the list, check it
against log->l_iclog_bufs...

> -
> -	if (icptr != log->l_iclog)
> -		xfs_emerg(log->l_mp, "%s: corrupt iclog ring", __func__);

.... because that is what this checks.

>  	spin_unlock(&log->l_icloglock);
>  
>  	/* check log magic numbers */
> @@ -3531,17 +3513,15 @@ STATIC int
>  xlog_iclogs_empty(
>  	struct xlog	*log)
>  {
> -	xlog_in_core_t	*iclog;
> +	struct xlog_in_core	*iclog;
>  
> -	iclog = log->l_iclog;
> -	do {
> +	list_for_each_entry(iclog, &log->l_iclogs, ic_list) {
>  		/* endianness does not matter here, zero is zero in
>  		 * any language.
>  		 */
>  		if (iclog->ic_header.h_num_logops)
>  			return 0;
> -		iclog = iclog->ic_next;
> -	} while (iclog != log->l_iclog);
> +	}
>  	return 1;

Called without icloglock held from debug code.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux