Re: Sequential read from NVMe/XFS twice slower on Fedora 42 than on Rocky 9.5

Laurence Oberman <loberman@xxxxxxxxxx> · Thu, 22 May 2025 11:07:53 -0400

On Mon, 2025-05-05 at 13:39 -0400, Laurence Oberman wrote:
> On Mon, 2025-05-05 at 09:21 -0400, Laurence Oberman wrote:
> > On Mon, 2025-05-05 at 08:29 -0400, Laurence Oberman wrote:
> > > On Mon, 2025-05-05 at 07:50 +1000, Dave Chinner wrote:
> > > > [cc linux-block]
> > > > 
> > > > [original bug report:
> > > > https://lore.kernel.org/linux-xfs/CAAiJnjoo0--yp47UKZhbu8sNSZN6DZ-QzmZBMmtr1oC=fOOgAQ@xxxxxxxxxxxxxx/
> > > >  ]
> > > > 
> > > > On Sun, May 04, 2025 at 10:22:58AM +0300, Anton Gavriliuk
> > > > wrote:
> > > > > > What's the comparitive performance of an identical read
> > > > > > profile
> > > > > > directly on the raw MD raid0 device?
> > > > > 
> > > > > Rocky 9.5 (5.14.0-503.40.1.el9_5.x86_64)
> > > > > 
> > > > > [root@localhost ~]# df -mh /mnt
> > > > > Filesystem      Size  Used Avail Use% Mounted on
> > > > > /dev/md127       35T  1.3T   34T   4% /mnt
> > > > > 
> > > > > [root@localhost ~]# fio --name=test --rw=read --bs=256k
> > > > > --filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 --
> > > > > exitall
> > > > > --group_reporting --ioengine=libaio --runtime=30 --time_based
> > > > > test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-
> > > > > 256KiB,
> > > > > (T)
> > > > > 256KiB-256KiB, ioengine=libaio, iodepth=64
> > > > > fio-3.39-44-g19d9
> > > > > Starting 1 process
> > > > > Jobs: 1 (f=1): [R(1)][100.0%][r=81.4GiB/s][r=334k IOPS][eta
> > > > > 00m:00s]
> > > > > test: (groupid=0, jobs=1): err= 0: pid=43189: Sun May  4
> > > > > 08:22:12
> > > > > 2025
> > > > >   read: IOPS=363k, BW=88.5GiB/s (95.1GB/s)(2656GiB/30001msec)
> > > > >     slat (nsec): min=971, max=312380, avg=1817.92,
> > > > > stdev=1367.75
> > > > >     clat (usec): min=78, max=1351, avg=174.46, stdev=28.86
> > > > >      lat (usec): min=80, max=1352, avg=176.27, stdev=28.81
> > > > > 
> > > > > Fedora 42 (6.14.5-300.fc42.x86_64)
> > > > > 
> > > > > [root@localhost anton]# df -mh /mnt
> > > > > Filesystem      Size  Used Avail Use% Mounted on
> > > > > /dev/md127       35T  1.3T   34T   4% /mnt
> > > > > 
> > > > > [root@localhost ~]# fio --name=test --rw=read --bs=256k
> > > > > --filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 --
> > > > > exitall
> > > > > --group_reporting --ioengine=libaio --runtime=30 --time_based
> > > > > test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-
> > > > > 256KiB,
> > > > > (T)
> > > > > 256KiB-256KiB, ioengine=libaio, iodepth=64
> > > > > fio-3.39-44-g19d9
> > > > > Starting 1 process
> > > > > Jobs: 1 (f=1): [R(1)][100.0%][r=41.0GiB/s][r=168k IOPS][eta
> > > > > 00m:00s]
> > > > > test: (groupid=0, jobs=1): err= 0: pid=5685: Sun May  4
> > > > > 10:14:00
> > > > > 2025
> > > > >   read: IOPS=168k, BW=41.0GiB/s (44.1GB/s)(1231GiB/30001msec)
> > > > >     slat (usec): min=3, max=273, avg= 5.63, stdev= 1.48
> > > > >     clat (usec): min=67, max=2800, avg=374.99, stdev=29.90
> > > > >      lat (usec): min=72, max=2914, avg=380.62, stdev=30.22
> > > > 
> > > > So the MD block device shows the same read performance as the
> > > > filesystem on top of it. That means this is a regression at the
> > > > MD
> > > > device layer or in the block/driver layers below it. i.e. it is
> > > > not
> > > > an XFS of filesystem issue at all.
> > > > 
> > > > -Dave.
> > > 
> > > I have a lab setup, let me see if I can also reproduce and then
> > > trace
> > > this to see where it is spending the time
> > > 
> > 
> > 
> > Not seeing 1/2 the bandwidth but also significantly slower on
> > Fedora42
> > kernel.
> > I will trace it
> > 
> > 9.5 kernel - 5.14.0-503.40.1.el9_5.x86_64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s-
> > 15.8GB/s), io=441GiB (473GB), run=30003-30003msec
> > 
> > Fedora42 kernel - 6.14.5-300.fc42.x86_64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=10.4GiB/s (11.2GB/s), 10.4GiB/s-10.4GiB/s (11.2GB/s-
> > 11.2GB/s), io=313GiB (336GB), run=30001-30001msec
> > 
> > 
> > 
> > 
> 
> Fedora42 kernel issue
> 
> While my difference is not as severe we do see a consistently lower
> performance on the Fedora
> kernel. (6.14.5-300.fc42.x86_64)
> 
> When I remove the software raid and run against a single NVME we
> converge to be much closer.
> Also latest upstream does not show this regression either.
> 
> Not sure yet what is in our Fedora kernel causing this. 
> We will work it via the Bugzilla
> 
> Regards
> Laurence
> 
> TLDR
> 
> 
> Fedora Kernel
> -------------
> root@penguin9 blktracefedora]# uname -a
> Linux penguin9.2 6.14.5-300.fc42.x86_64 #1 SMP PREEMPT_DYNAMIC Fri
> May
> 2 14:16:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
> 
> 5 runs of the fio against /dev/md1
> 
> [root@penguin9 ~]# for i in 1 2 3 4 5
> > do
> > ./run_fio.sh | grep -A1 "Run status group"
> > done
> Run status group 0 (all jobs):
>    READ: bw=11.3GiB/s (12.2GB/s), 11.3GiB/s-11.3GiB/s (12.2GB/s-
> 12.2GB/s), io=679GiB (729GB), run=60001-60001msec
> Run status group 0 (all jobs):
>    READ: bw=11.2GiB/s (12.0GB/s), 11.2GiB/s-11.2GiB/s (12.0GB/s-
> 12.0GB/s), io=669GiB (718GB), run=60001-60001msec
> Run status group 0 (all jobs):
>    READ: bw=11.4GiB/s (12.2GB/s), 11.4GiB/s-11.4GiB/s (12.2GB/s-
> 12.2GB/s), io=682GiB (733GB), run=60001-60001msec
> Run status group 0 (all jobs):
>    READ: bw=11.1GiB/s (11.9GB/s), 11.1GiB/s-11.1GiB/s (11.9GB/s-
> 11.9GB/s), io=664GiB (713GB), run=60001-60001msec
> Run status group 0 (all jobs):
>    READ: bw=11.3GiB/s (12.1GB/s), 11.3GiB/s-11.3GiB/s (12.1GB/s-
> 12.1GB/s), io=678GiB (728GB), run=60001-60001msec
> 
> RHEL9.5
> ------------
> Linux penguin9.2 5.14.0-503.40.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC
> Thu Apr 24 08:27:29 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@penguin9 ~]# for i in 1 2 3 4 5; do ./run_fio.sh | grep -A1
> "Run
> status group"; done
> Run status group 0 (all jobs):
>    READ: bw=14.9GiB/s (16.0GB/s), 14.9GiB/s-14.9GiB/s (16.0GB/s-
> 16.0GB/s), io=894GiB (960GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=14.6GiB/s (15.6GB/s), 14.6GiB/s-14.6GiB/s (15.6GB/s-
> 15.6GB/s), io=873GiB (938GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=14.9GiB/s (16.0GB/s), 14.9GiB/s-14.9GiB/s (16.0GB/s-
> 16.0GB/s), io=892GiB (958GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=14.5GiB/s (15.6GB/s), 14.5GiB/s-14.5GiB/s (15.6GB/s-
> 15.6GB/s), io=872GiB (936GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s-
> 15.8GB/s), io=884GiB (950GB), run=60003-60003msec
> 
> 
> Remove software raid from the layers and test just on a single nvme
> ---------------------------------------------------------------------
> -
> 
> fio --name=test --rw=read --bs=256k --filename=/dev/nvme23n1 --
> direct=1
> --numjobs=1 --iodepth=64 --exitall --group_reporting --
> ioengine=libaio
> --runtime=60 --time_based
> 
> Linux penguin9.2 5.14.0-503.40.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC
> Thu Apr 24 08:27:29 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@penguin9 ~]# ./run_nvme_fio.sh
> 
> Run status group 0 (all jobs):
>    READ: bw=3207MiB/s (3363MB/s), 3207MiB/s-3207MiB/s (3363MB/s-
> 3363MB/s), io=188GiB (202GB), run=60005-60005msec
> 
> 
> Back to fedora kernel
> 
> [root@penguin9 ~]# uname -a
> Linux penguin9.2 6.14.5-300.fc42.x86_64 #1 SMP PREEMPT_DYNAMIC Fri
> May
> 2 14:16:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
> 
> Within the margin of error
> 
> Run status group 0 (all jobs):
>    READ: bw=3061MiB/s (3210MB/s), 3061MiB/s-3061MiB/s (3210MB/s-
> 3210MB/s), io=179GiB (193GB), run=60006-60006msec
> 
> 
> Try recent upstream kernel
> ---------------------------
> [root@penguin9 ~]# uname -a
> Linux penguin9.2 6.13.0-rc7+ #2 SMP PREEMPT_DYNAMIC Mon May  5
> 10:59:12
> EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@penguin9 ~]# for i in 1 2 3 4 5; do ./run_fio.sh | grep -A1
> "Run
> status group"; done
> Run status group 0 (all jobs):
>    READ: bw=14.6GiB/s (15.7GB/s), 14.6GiB/s-14.6GiB/s (15.7GB/s-
> 15.7GB/s), io=876GiB (941GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s-
> 15.9GB/s), io=891GiB (957GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s-
> 15.9GB/s), io=890GiB (956GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=14.5GiB/s (15.6GB/s), 14.5GiB/s-14.5GiB/s (15.6GB/s-
> 15.6GB/s), io=871GiB (935GB), run=60003-60003msec
> 
> 
> Update to latest upstream
> -------------------------
> 
> [root@penguin9 ~]# uname -a
> Linux penguin9.2 6.15.0-rc5 #1 SMP PREEMPT_DYNAMIC Mon May  5
> 12:18:22
> EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
> 
> Single nvme device is once again fine
> 
> Run status group 0 (all jobs):
>    READ: bw=3061MiB/s (3210MB/s), 3061MiB/s-3061MiB/s (3210MB/s-
> 3210MB/s), io=179GiB (193GB), run=60006-60006msec
> 
> 
> [root@penguin9 ~]# for i in 1 2 3 4 5; do ./run_fio.sh | grep -A1
> "Run
> status group"; done
> Run status group 0 (all jobs):
>    READ: bw=14.7GiB/s (15.7GB/s), 14.7GiB/s-14.7GiB/s (15.7GB/s-
> 15.7GB/s), io=880GiB (945GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=18.1GiB/s (19.4GB/s), 18.1GiB/s-18.1GiB/s (19.4GB/s-
> 19.4GB/s), io=1087GiB (1167GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=18.0GiB/s (19.4GB/s), 18.0GiB/s-18.0GiB/s (19.4GB/s-
> 19.4GB/s), io=1082GiB (1162GB), run=60003-60003msec
> Run status group 0 (all jobs):
>    READ: bw=18.2GiB/s (19.5GB/s), 18.2GiB/s-18.2GiB/s (19.5GB/s-
> 19.5GB/s), io=1090GiB (1170GB), run=60005-60005msec
> 
> 

This fell of my radar, I aologize and I was on PTO last week.
Here is the Fedora kernel to install as mentioned
https://people.redhat.com/loberman/customer/.fedora/

tar hxvf fedora_kernel.tar.xz
rpm -ivh --force --nodeps *.rpm