Re: [report] Unixbench shell1 performance regression

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 18 Mar 2025 07:43:45 +1100

On Mon, Mar 17, 2025 at 08:25:16AM +0800, Gao Xiang wrote:
> Hi Dave,
> 
> On 2025/3/17 05:25, Dave Chinner wrote:
> > On Sat, Mar 15, 2025 at 01:19:31AM +0800, Gao Xiang wrote:
> > > Hi folks,
> > > 
> > > Days ago, I received a XFS Unixbench[1] shell1 (high-concurrency)
> > > performance regression during a benchmark comparison between XFS and
> > > EXT4:  The XFS result was lower than EXT4 by 15% on Linux 6.6.y with
> > > 144-core aarch64 (64K page size).  Since Unixbench is somewhat important
> > > to indicate overall system performance for many end users, it's not
> > > a good result.
> > 
> > Unixbench isn't really that indicative of typical worklaods on large
> > core-count machines these days. It's an ancient benchmark, and it's
> > exceedingly rare that a modern machine is fully loaded with shell
> > scripts such as the shell1 test is running because it's highly
> > inefficient to do large scale concurrent processing of data in this
> > way....
> > 
> > Indeed, look at the file copy "benchmarks" it runs - the use buffer
> > sizes of 256, 1024 and 4096 bytes to tell you how well the
> > filesystem performs. Using sub-page size buffers might have been
> > common for 1983-era CPUs to get the highest possible file copy
> > throughput, but these days these are slow paths that we largely
> > don't optimise for highest throughput. Measuring modern system
> > scalability via how such operations perform is largely meaningless
> > because applications don't behave this way anymore....
> 
> Thanks for your reply!
> 
> Sigh.  Many customers really care, and they select the whole software
> stack based on this benchmark.

People using benchmarks that have no relevance to their
software/application stack behaviour as the basis of their purchase
decisions has been happening for decades.

> If they think the results are not good, they might ask us to move away
> of XFS filesystem.  It's not what I could do anything, you know.

If they think there is a filesystem better suited to their
requirements than XFS, then they are free to make that decision
themselves. We can point out that their selection metrics are
irrelevant to their actual workload, but in my experience this just
makes the people running the selection trial more convinced they are
right and they still make a poor decision....

> > > shell1 test[2] basically runs in a loop that it executes commands
> > > to generate files (sort.$$, od.$$, grep.$$, wc.$$) and then remove
> > > them.  The testcase lasts for one minute and then show the total number
> > > of iterations.
> > > 
> > > While no difference was observed in single-threaded results, it showed
> > > a noticeable difference above if  `./Run shell1 -c 144 -i 1`  is used.
> > 
> > I'm betting that the XFS filesystem is small and only has 4 AGs,
> > and so has very limited concurrency in allocation.
> > 
> > i.e. you're trying to run a massively concurrent workload on a
> > filesystem that only has - at best - the ability to do 4 allocations
> > or frees at a time. Of course it is going to contend on the
> > allocation group locks....
> 
> I've adjusted this, from 4 AG to 20 AG.  No real impact.

Yup, still very limited concurrency considering that you are running
144 instances of that workload (which, AFAICT, are all doing
independent work).  This implies that a couple of hundred AGs would
be needed to provide sufficient allocation concurrency for this sort
of workload.

> > > I tried to do some hack to disable defer inode inactivation as below,
> > > the shell1 benchmark then recovered: XFS (35649.6 -> 37810.9):
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 7b6c026d01a1..d9fb2ef3686a 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -2059,6 +2059,7 @@ void
> > >   xfs_inodegc_start(
> > >   	struct xfs_mount	*mp)
> > >   {
> > > +	return;
> > >   	if (xfs_set_inodegc_enabled(mp))
> > >   		return;
> > > 
> > > @@ -2180,6 +2181,12 @@ xfs_inodegc_queue(
> > >   	ip->i_flags |= XFS_NEED_INACTIVE;
> > >   	spin_unlock(&ip->i_flags_lock);
> > > 
> > > +	if (1) {
> > > +		xfs_iflags_set(ip, XFS_INACTIVATING);
> > > +		xfs_inodegc_inactivate(ip);
> > > +		return;
> > > +	}
> > 
> > That reintroduces potential deadlock vectors by running blocking
> > transactions directly from iput() and/or memory reclaim. That's one
> > of the main reasons we moved inactivation to a background thread -
> > it gets rid of an entire class of potential deadlock problems....
> 
> Yeah, I noticed that too, mainly
> commit 68b957f64fca ("xfs: load uncached unlinked inodes into memory
> on demand").

That is not related to the class of deadlocks and issues I'm
referring to. Running a transaction in memory reclaim context (i.e.
shrinker evicts the inode from memory) means that memory reclaim now
blocks on journal space, IO, buffer locks, etc.

The sort of deadlock that this can cause is a non-transactional
operation above memory reclaim holding a buffer lock (e.g. bulkstat
reading the AGI btree), then requiring memory allocation (e.g.
pulling a AGI btree block into memory) which triggers direct memory
reclaim, which then tries to inactivate an inode, which then
(for whatever reason) requires taking a AGI btree block lock....

That is the class of potential deadlock that background inode
inactivation avoids completely. It also avoids excessive inode
eviction latency (important as shrinkers run from direct reclaim
are supposed to be non-blocking) and other sub-optimal inode
eviction behaviours from occurring...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx