Re: Sequential read from NVMe/XFS twice slower on Fedora 42 than on Rocky 9.5

Anton Gavriliuk <antosha20xx@xxxxxxxxx> · Tue, 6 May 2025 14:03:37 +0300

> So is this MD chunk size related? i.e. what is the chunk size
> the MD device? Is it smaller than the IO size (256kB) or larger?
> Does the regression go away if the chunk size matches the IO size,
> or if the IO size vs chunk size relationship is reversed?

According to the output below, the chunk size is 512K,

[root@localhost anton]# mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Thu Apr 17 14:58:23 2025
        Raid Level : raid0
        Array Size : 37505814528 (34.93 TiB 38.41 TB)
      Raid Devices : 12
     Total Devices : 12
       Persistence : Superblock is persistent

       Update Time : Thu Apr 17 14:58:23 2025
             State : clean
    Active Devices : 12
   Working Devices : 12
    Failed Devices : 0
     Spare Devices : 0

            Layout : original
        Chunk Size : 512K

Consistency Policy : none

              Name : localhost.localdomain:127  (local to host
localhost.localdomain)
              UUID : 2fadc96b:f37753af:f3b528a0:067c320d
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259       15        0      active sync   /dev/nvme7n1
       1     259       27        1      active sync   /dev/nvme0n1
       2     259       10        2      active sync   /dev/nvme1n1
       3     259       28        3      active sync   /dev/nvme2n1
       4     259       13        4      active sync   /dev/nvme8n1
       5     259       22        5      active sync   /dev/nvme5n1
       6     259       26        6      active sync   /dev/nvme3n1
       7     259       16        7      active sync   /dev/nvme4n1
       8     259       24        8      active sync   /dev/nvme9n1
       9     259       14        9      active sync   /dev/nvme10n1
      10     259       25       10      active sync   /dev/nvme11n1
      11     259       12       11      active sync   /dev/nvme12n1
[root@localhost anton]# uname -r
6.14.5-300.fc42.x86_64
[root@localhost anton]# cat /proc/mdstat
Personalities : [raid0]
md127 : active raid0 nvme4n1[7] nvme1n1[2] nvme12n1[11] nvme7n1[0]
nvme9n1[8] nvme11n1[10] nvme2n1[3] nvme8n1[4] nvme0n1[1] nvme5n1[5]
nvme3n1[6] nvme10n1[9]
      37505814528 blocks super 1.2 512k chunks

unused devices: <none>
[root@localhost anton]#

When I/O size is less 512K

[root@localhost ~]# fio --name=test --rw=read --bs=256k
--filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 --exitall
--group_reporting --ioengine=libaio --runtime=30 --time_based
test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T)
256KiB-256KiB, ioengine=libaio, iodepth=64
fio-3.39-44-g19d9
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=48.1GiB/s][r=197k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=14340: Tue May  6 13:59:23 2025
  read: IOPS=197k, BW=48.0GiB/s (51.6GB/s)(1441GiB/30001msec)
    slat (usec): min=3, max=1041, avg= 4.74, stdev= 1.48
    clat (usec): min=76, max=2042, avg=320.30, stdev=26.82
     lat (usec): min=79, max=2160, avg=325.04, stdev=27.08

When I/O size is greater 512K

[root@localhost ~]# fio --name=test --rw=read --bs=1024k
--filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 --exitall
--group_reporting --ioengine=libaio --runtime=30 --time_based
test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T)
1024KiB-1024KiB, ioengine=libaio, iodepth=64
fio-3.39-44-g19d9
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=63.7GiB/s][r=65.2k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=14395: Tue May  6 14:00:28 2025
  read: IOPS=64.6k, BW=63.0GiB/s (67.7GB/s)(1891GiB/30001msec)
    slat (usec): min=9, max=1045, avg=15.12, stdev= 3.84
    clat (usec): min=81, max=18494, avg=975.87, stdev=112.11
     lat (usec): min=96, max=18758, avg=990.99, stdev=113.49

But still much worse than with 256k on Rocky 9.5

Anton

вт, 6 мая 2025 г. в 01:56, Dave Chinner <david@xxxxxxxxxxxxx>:
>
> On Mon, May 05, 2025 at 09:21:19AM -0400, Laurence Oberman wrote:
> > On Mon, 2025-05-05 at 08:29 -0400, Laurence Oberman wrote:
> > > On Mon, 2025-05-05 at 07:50 +1000, Dave Chinner wrote:
> > > > So the MD block device shows the same read performance as the
> > > > filesystem on top of it. That means this is a regression at the MD
> > > > device layer or in the block/driver layers below it. i.e. it is not
> > > > an XFS of filesystem issue at all.
> > > >
> > > > -Dave.
> > >
> > > I have a lab setup, let me see if I can also reproduce and then trace
> > > this to see where it is spending the time
> > >
> >
> >
> > Not seeing 1/2 the bandwidth but also significantly slower on Fedora42
> > kernel.
> > I will trace it
> >
> > 9.5 kernel - 5.14.0-503.40.1.el9_5.x86_64
> >
> > Run status group 0 (all jobs):
> >    READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s-
> > 15.8GB/s), io=441GiB (473GB), run=30003-30003msec
> >
> > Fedora42 kernel - 6.14.5-300.fc42.x86_64
> >
> > Run status group 0 (all jobs):
> >    READ: bw=10.4GiB/s (11.2GB/s), 10.4GiB/s-10.4GiB/s (11.2GB/s-
> > 11.2GB/s), io=313GiB (336GB), run=30001-30001msec
>
> So is this MD chunk size related? i.e. what is the chunk size
> the MD device? Is it smaller than the IO size (256kB) or larger?
> Does the regression go away if the chunk size matches the IO size,
> or if the IO size vs chunk size relationship is reversed?
>
> -Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx