> So is this MD chunk size related? i.e. what is the chunk size > the MD device? Is it smaller than the IO size (256kB) or larger? > Does the regression go away if the chunk size matches the IO size, > or if the IO size vs chunk size relationship is reversed? According to the output below, the chunk size is 512K, [root@localhost anton]# mdadm -D /dev/md127 /dev/md127: Version : 1.2 Creation Time : Thu Apr 17 14:58:23 2025 Raid Level : raid0 Array Size : 37505814528 (34.93 TiB 38.41 TB) Raid Devices : 12 Total Devices : 12 Persistence : Superblock is persistent Update Time : Thu Apr 17 14:58:23 2025 State : clean Active Devices : 12 Working Devices : 12 Failed Devices : 0 Spare Devices : 0 Layout : original Chunk Size : 512K Consistency Policy : none Name : localhost.localdomain:127 (local to host localhost.localdomain) UUID : 2fadc96b:f37753af:f3b528a0:067c320d Events : 0 Number Major Minor RaidDevice State 0 259 15 0 active sync /dev/nvme7n1 1 259 27 1 active sync /dev/nvme0n1 2 259 10 2 active sync /dev/nvme1n1 3 259 28 3 active sync /dev/nvme2n1 4 259 13 4 active sync /dev/nvme8n1 5 259 22 5 active sync /dev/nvme5n1 6 259 26 6 active sync /dev/nvme3n1 7 259 16 7 active sync /dev/nvme4n1 8 259 24 8 active sync /dev/nvme9n1 9 259 14 9 active sync /dev/nvme10n1 10 259 25 10 active sync /dev/nvme11n1 11 259 12 11 active sync /dev/nvme12n1 [root@localhost anton]# uname -r 6.14.5-300.fc42.x86_64 [root@localhost anton]# cat /proc/mdstat Personalities : [raid0] md127 : active raid0 nvme4n1[7] nvme1n1[2] nvme12n1[11] nvme7n1[0] nvme9n1[8] nvme11n1[10] nvme2n1[3] nvme8n1[4] nvme0n1[1] nvme5n1[5] nvme3n1[6] nvme10n1[9] 37505814528 blocks super 1.2 512k chunks unused devices: <none> [root@localhost anton]# When I/O size is less 512K [root@localhost ~]# fio --name=test --rw=read --bs=256k --filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=30 --time_based test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T) 256KiB-256KiB, ioengine=libaio, iodepth=64 fio-3.39-44-g19d9 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=48.1GiB/s][r=197k IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=14340: Tue May 6 13:59:23 2025 read: IOPS=197k, BW=48.0GiB/s (51.6GB/s)(1441GiB/30001msec) slat (usec): min=3, max=1041, avg= 4.74, stdev= 1.48 clat (usec): min=76, max=2042, avg=320.30, stdev=26.82 lat (usec): min=79, max=2160, avg=325.04, stdev=27.08 When I/O size is greater 512K [root@localhost ~]# fio --name=test --rw=read --bs=1024k --filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=30 --time_based test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64 fio-3.39-44-g19d9 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=63.7GiB/s][r=65.2k IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=14395: Tue May 6 14:00:28 2025 read: IOPS=64.6k, BW=63.0GiB/s (67.7GB/s)(1891GiB/30001msec) slat (usec): min=9, max=1045, avg=15.12, stdev= 3.84 clat (usec): min=81, max=18494, avg=975.87, stdev=112.11 lat (usec): min=96, max=18758, avg=990.99, stdev=113.49 But still much worse than with 256k on Rocky 9.5 Anton вт, 6 мая 2025 г. в 01:56, Dave Chinner <david@xxxxxxxxxxxxx>: > > On Mon, May 05, 2025 at 09:21:19AM -0400, Laurence Oberman wrote: > > On Mon, 2025-05-05 at 08:29 -0400, Laurence Oberman wrote: > > > On Mon, 2025-05-05 at 07:50 +1000, Dave Chinner wrote: > > > > So the MD block device shows the same read performance as the > > > > filesystem on top of it. That means this is a regression at the MD > > > > device layer or in the block/driver layers below it. i.e. it is not > > > > an XFS of filesystem issue at all. > > > > > > > > -Dave. > > > > > > I have a lab setup, let me see if I can also reproduce and then trace > > > this to see where it is spending the time > > > > > > > > > Not seeing 1/2 the bandwidth but also significantly slower on Fedora42 > > kernel. > > I will trace it > > > > 9.5 kernel - 5.14.0-503.40.1.el9_5.x86_64 > > > > Run status group 0 (all jobs): > > READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s- > > 15.8GB/s), io=441GiB (473GB), run=30003-30003msec > > > > Fedora42 kernel - 6.14.5-300.fc42.x86_64 > > > > Run status group 0 (all jobs): > > READ: bw=10.4GiB/s (11.2GB/s), 10.4GiB/s-10.4GiB/s (11.2GB/s- > > 11.2GB/s), io=313GiB (336GB), run=30001-30001msec > > So is this MD chunk size related? i.e. what is the chunk size > the MD device? Is it smaller than the IO size (256kB) or larger? > Does the regression go away if the chunk size matches the IO size, > or if the IO size vs chunk size relationship is reversed? > > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx