On Mon, 2025-05-05 at 13:39 -0400, Laurence Oberman wrote: > On Mon, 2025-05-05 at 09:21 -0400, Laurence Oberman wrote: > > On Mon, 2025-05-05 at 08:29 -0400, Laurence Oberman wrote: > > > On Mon, 2025-05-05 at 07:50 +1000, Dave Chinner wrote: > > > > [cc linux-block] > > > > > > > > [original bug report: > > > > https://lore.kernel.org/linux-xfs/CAAiJnjoo0--yp47UKZhbu8sNSZN6DZ-QzmZBMmtr1oC=fOOgAQ@xxxxxxxxxxxxxx/ > > > > ] > > > > > > > > On Sun, May 04, 2025 at 10:22:58AM +0300, Anton Gavriliuk > > > > wrote: > > > > > > What's the comparitive performance of an identical read > > > > > > profile > > > > > > directly on the raw MD raid0 device? > > > > > > > > > > Rocky 9.5 (5.14.0-503.40.1.el9_5.x86_64) > > > > > > > > > > [root@localhost ~]# df -mh /mnt > > > > > Filesystem Size Used Avail Use% Mounted on > > > > > /dev/md127 35T 1.3T 34T 4% /mnt > > > > > > > > > > [root@localhost ~]# fio --name=test --rw=read --bs=256k > > > > > --filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 -- > > > > > exitall > > > > > --group_reporting --ioengine=libaio --runtime=30 --time_based > > > > > test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB- > > > > > 256KiB, > > > > > (T) > > > > > 256KiB-256KiB, ioengine=libaio, iodepth=64 > > > > > fio-3.39-44-g19d9 > > > > > Starting 1 process > > > > > Jobs: 1 (f=1): [R(1)][100.0%][r=81.4GiB/s][r=334k IOPS][eta > > > > > 00m:00s] > > > > > test: (groupid=0, jobs=1): err= 0: pid=43189: Sun May 4 > > > > > 08:22:12 > > > > > 2025 > > > > > read: IOPS=363k, BW=88.5GiB/s (95.1GB/s)(2656GiB/30001msec) > > > > > slat (nsec): min=971, max=312380, avg=1817.92, > > > > > stdev=1367.75 > > > > > clat (usec): min=78, max=1351, avg=174.46, stdev=28.86 > > > > > lat (usec): min=80, max=1352, avg=176.27, stdev=28.81 > > > > > > > > > > Fedora 42 (6.14.5-300.fc42.x86_64) > > > > > > > > > > [root@localhost anton]# df -mh /mnt > > > > > Filesystem Size Used Avail Use% Mounted on > > > > > /dev/md127 35T 1.3T 34T 4% /mnt > > > > > > > > > > [root@localhost ~]# fio --name=test --rw=read --bs=256k > > > > > --filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 -- > > > > > exitall > > > > > --group_reporting --ioengine=libaio --runtime=30 --time_based > > > > > test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB- > > > > > 256KiB, > > > > > (T) > > > > > 256KiB-256KiB, ioengine=libaio, iodepth=64 > > > > > fio-3.39-44-g19d9 > > > > > Starting 1 process > > > > > Jobs: 1 (f=1): [R(1)][100.0%][r=41.0GiB/s][r=168k IOPS][eta > > > > > 00m:00s] > > > > > test: (groupid=0, jobs=1): err= 0: pid=5685: Sun May 4 > > > > > 10:14:00 > > > > > 2025 > > > > > read: IOPS=168k, BW=41.0GiB/s (44.1GB/s)(1231GiB/30001msec) > > > > > slat (usec): min=3, max=273, avg= 5.63, stdev= 1.48 > > > > > clat (usec): min=67, max=2800, avg=374.99, stdev=29.90 > > > > > lat (usec): min=72, max=2914, avg=380.62, stdev=30.22 > > > > > > > > So the MD block device shows the same read performance as the > > > > filesystem on top of it. That means this is a regression at the > > > > MD > > > > device layer or in the block/driver layers below it. i.e. it is > > > > not > > > > an XFS of filesystem issue at all. > > > > > > > > -Dave. > > > > > > I have a lab setup, let me see if I can also reproduce and then > > > trace > > > this to see where it is spending the time > > > > > > > > > Not seeing 1/2 the bandwidth but also significantly slower on > > Fedora42 > > kernel. > > I will trace it > > > > 9.5 kernel - 5.14.0-503.40.1.el9_5.x86_64 > > > > Run status group 0 (all jobs): > > READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s- > > 15.8GB/s), io=441GiB (473GB), run=30003-30003msec > > > > Fedora42 kernel - 6.14.5-300.fc42.x86_64 > > > > Run status group 0 (all jobs): > > READ: bw=10.4GiB/s (11.2GB/s), 10.4GiB/s-10.4GiB/s (11.2GB/s- > > 11.2GB/s), io=313GiB (336GB), run=30001-30001msec > > > > > > > > > > Fedora42 kernel issue > > While my difference is not as severe we do see a consistently lower > performance on the Fedora > kernel. (6.14.5-300.fc42.x86_64) > > When I remove the software raid and run against a single NVME we > converge to be much closer. > Also latest upstream does not show this regression either. > > Not sure yet what is in our Fedora kernel causing this. > We will work it via the Bugzilla > > Regards > Laurence > > TLDR > > > Fedora Kernel > ------------- > root@penguin9 blktracefedora]# uname -a > Linux penguin9.2 6.14.5-300.fc42.x86_64 #1 SMP PREEMPT_DYNAMIC Fri > May > 2 14:16:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux > > 5 runs of the fio against /dev/md1 > > [root@penguin9 ~]# for i in 1 2 3 4 5 > > do > > ./run_fio.sh | grep -A1 "Run status group" > > done > Run status group 0 (all jobs): > READ: bw=11.3GiB/s (12.2GB/s), 11.3GiB/s-11.3GiB/s (12.2GB/s- > 12.2GB/s), io=679GiB (729GB), run=60001-60001msec > Run status group 0 (all jobs): > READ: bw=11.2GiB/s (12.0GB/s), 11.2GiB/s-11.2GiB/s (12.0GB/s- > 12.0GB/s), io=669GiB (718GB), run=60001-60001msec > Run status group 0 (all jobs): > READ: bw=11.4GiB/s (12.2GB/s), 11.4GiB/s-11.4GiB/s (12.2GB/s- > 12.2GB/s), io=682GiB (733GB), run=60001-60001msec > Run status group 0 (all jobs): > READ: bw=11.1GiB/s (11.9GB/s), 11.1GiB/s-11.1GiB/s (11.9GB/s- > 11.9GB/s), io=664GiB (713GB), run=60001-60001msec > Run status group 0 (all jobs): > READ: bw=11.3GiB/s (12.1GB/s), 11.3GiB/s-11.3GiB/s (12.1GB/s- > 12.1GB/s), io=678GiB (728GB), run=60001-60001msec > > RHEL9.5 > ------------ > Linux penguin9.2 5.14.0-503.40.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC > Thu Apr 24 08:27:29 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux > > [root@penguin9 ~]# for i in 1 2 3 4 5; do ./run_fio.sh | grep -A1 > "Run > status group"; done > Run status group 0 (all jobs): > READ: bw=14.9GiB/s (16.0GB/s), 14.9GiB/s-14.9GiB/s (16.0GB/s- > 16.0GB/s), io=894GiB (960GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=14.6GiB/s (15.6GB/s), 14.6GiB/s-14.6GiB/s (15.6GB/s- > 15.6GB/s), io=873GiB (938GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=14.9GiB/s (16.0GB/s), 14.9GiB/s-14.9GiB/s (16.0GB/s- > 16.0GB/s), io=892GiB (958GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=14.5GiB/s (15.6GB/s), 14.5GiB/s-14.5GiB/s (15.6GB/s- > 15.6GB/s), io=872GiB (936GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s- > 15.8GB/s), io=884GiB (950GB), run=60003-60003msec > > > Remove software raid from the layers and test just on a single nvme > --------------------------------------------------------------------- > - > > fio --name=test --rw=read --bs=256k --filename=/dev/nvme23n1 -- > direct=1 > --numjobs=1 --iodepth=64 --exitall --group_reporting -- > ioengine=libaio > --runtime=60 --time_based > > Linux penguin9.2 5.14.0-503.40.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC > Thu Apr 24 08:27:29 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux > > [root@penguin9 ~]# ./run_nvme_fio.sh > > Run status group 0 (all jobs): > READ: bw=3207MiB/s (3363MB/s), 3207MiB/s-3207MiB/s (3363MB/s- > 3363MB/s), io=188GiB (202GB), run=60005-60005msec > > > Back to fedora kernel > > [root@penguin9 ~]# uname -a > Linux penguin9.2 6.14.5-300.fc42.x86_64 #1 SMP PREEMPT_DYNAMIC Fri > May > 2 14:16:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux > > Within the margin of error > > Run status group 0 (all jobs): > READ: bw=3061MiB/s (3210MB/s), 3061MiB/s-3061MiB/s (3210MB/s- > 3210MB/s), io=179GiB (193GB), run=60006-60006msec > > > Try recent upstream kernel > --------------------------- > [root@penguin9 ~]# uname -a > Linux penguin9.2 6.13.0-rc7+ #2 SMP PREEMPT_DYNAMIC Mon May 5 > 10:59:12 > EDT 2025 x86_64 x86_64 x86_64 GNU/Linux > > [root@penguin9 ~]# for i in 1 2 3 4 5; do ./run_fio.sh | grep -A1 > "Run > status group"; done > Run status group 0 (all jobs): > READ: bw=14.6GiB/s (15.7GB/s), 14.6GiB/s-14.6GiB/s (15.7GB/s- > 15.7GB/s), io=876GiB (941GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s- > 15.9GB/s), io=891GiB (957GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s- > 15.9GB/s), io=890GiB (956GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=14.5GiB/s (15.6GB/s), 14.5GiB/s-14.5GiB/s (15.6GB/s- > 15.6GB/s), io=871GiB (935GB), run=60003-60003msec > > > Update to latest upstream > ------------------------- > > [root@penguin9 ~]# uname -a > Linux penguin9.2 6.15.0-rc5 #1 SMP PREEMPT_DYNAMIC Mon May 5 > 12:18:22 > EDT 2025 x86_64 x86_64 x86_64 GNU/Linux > > Single nvme device is once again fine > > Run status group 0 (all jobs): > READ: bw=3061MiB/s (3210MB/s), 3061MiB/s-3061MiB/s (3210MB/s- > 3210MB/s), io=179GiB (193GB), run=60006-60006msec > > > [root@penguin9 ~]# for i in 1 2 3 4 5; do ./run_fio.sh | grep -A1 > "Run > status group"; done > Run status group 0 (all jobs): > READ: bw=14.7GiB/s (15.7GB/s), 14.7GiB/s-14.7GiB/s (15.7GB/s- > 15.7GB/s), io=880GiB (945GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=18.1GiB/s (19.4GB/s), 18.1GiB/s-18.1GiB/s (19.4GB/s- > 19.4GB/s), io=1087GiB (1167GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=18.0GiB/s (19.4GB/s), 18.0GiB/s-18.0GiB/s (19.4GB/s- > 19.4GB/s), io=1082GiB (1162GB), run=60003-60003msec > Run status group 0 (all jobs): > READ: bw=18.2GiB/s (19.5GB/s), 18.2GiB/s-18.2GiB/s (19.5GB/s- > 19.5GB/s), io=1090GiB (1170GB), run=60005-60005msec > > This fell of my radar, I aologize and I was on PTO last week. Here is the Fedora kernel to install as mentioned https://people.redhat.com/loberman/customer/.fedora/ tar hxvf fedora_kernel.tar.xz rpm -ivh --force --nodeps *.rpm