Re: [report] Unixbench shell1 performance regression

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 17 Mar 2025 08:25:54 +1100

On Sat, Mar 15, 2025 at 01:19:31AM +0800, Gao Xiang wrote:
> Hi folks,
> 
> Days ago, I received a XFS Unixbench[1] shell1 (high-concurrency)
> performance regression during a benchmark comparison between XFS and
> EXT4:  The XFS result was lower than EXT4 by 15% on Linux 6.6.y with
> 144-core aarch64 (64K page size).  Since Unixbench is somewhat important
> to indicate overall system performance for many end users, it's not
> a good result.

Unixbench isn't really that indicative of typical worklaods on large
core-count machines these days. It's an ancient benchmark, and it's
exceedingly rare that a modern machine is fully loaded with shell
scripts such as the shell1 test is running because it's highly
inefficient to do large scale concurrent processing of data in this
way....

Indeed, look at the file copy "benchmarks" it runs - the use buffer
sizes of 256, 1024 and 4096 bytes to tell you how well the
filesystem performs. Using sub-page size buffers might have been
common for 1983-era CPUs to get the highest possible file copy
throughput, but these days these are slow paths that we largely
don't optimise for highest throughput. Measuring modern system
scalability via how such operations perform is largely meaningless
because applications don't behave this way anymore....

> shell1 test[2] basically runs in a loop that it executes commands
> to generate files (sort.$$, od.$$, grep.$$, wc.$$) and then remove
> them.  The testcase lasts for one minute and then show the total number
> of iterations.
> 
> While no difference was observed in single-threaded results, it showed
> a noticeable difference above if  `./Run shell1 -c 144 -i 1`  is used.

I'm betting that the XFS filesystem is small and only has 4 AGs,
and so has very limited concurrency in allocation.

i.e. you're trying to run a massively concurrent workload on a
filesystem that only has - at best - the ability to do 4 allocations
or frees at a time. Of course it is going to contend on the
allocation group locks....

> The original report was on aarch64, but I could still reproduce some
> difference on Linux 6.13 with a X86 physical machine:
> 
> Intel(R) Xeon(R) Platinum 8331C CPU @ 2.50GHz * 96 cores
> 512 GiB memory
> 
> XFS (35649.6) is still lower than EXT4 (37146.0) by 4% and
> the kconfig is attached.
> 
> However, I don't observe much difference on 5.10.y kernels.  After
> collecting some off-CPU trace, I found there are many new agi buf
> lock waits compared with the correspoinding 5.10.y trace, as below:

Yes, because background inactivation can increase the contention on
AGF/AGI buffer locks when there is insufficient concurrency in the
filesystem layout. It is rare, however, that any workload other that
benchmarks generate enough load and/or concurrency to reach the
thresholds where such lock breakdown occurs.

> I tried to do some hack to disable defer inode inactivation as below,
> the shell1 benchmark then recovered: XFS (35649.6 -> 37810.9):
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 7b6c026d01a1..d9fb2ef3686a 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -2059,6 +2059,7 @@ void
>  xfs_inodegc_start(
>  	struct xfs_mount	*mp)
>  {
> +	return;
>  	if (xfs_set_inodegc_enabled(mp))
>  		return;
> 
> @@ -2180,6 +2181,12 @@ xfs_inodegc_queue(
>  	ip->i_flags |= XFS_NEED_INACTIVE;
>  	spin_unlock(&ip->i_flags_lock);
> 
> +	if (1) {
> +		xfs_iflags_set(ip, XFS_INACTIVATING);
> +		xfs_inodegc_inactivate(ip);
> +		return;
> +	}

That reintroduces potential deadlock vectors by running blocking
transactions directly from iput() and/or memory reclaim. That's one
of the main reasons we moved inactivation to a background thread -
it gets rid of an entire class of potential deadlock problems....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx