Re: Sequential read from NVMe/XFS twice slower on Fedora 42 than on Rocky 9.5

Anton Gavriliuk <antosha20xx@xxxxxxxxx> · Wed, 7 May 2025 15:26:08 +0300

> `iostat -dxm 5` output during the fio run on both kernels will give us some indication of the differences in IO patterns, queue depths, etc.

iostat files attached.

fedora 42

[root@localhost ~]# fio --name=test --rw=read --bs=256k
--filename=/mnt/testfile --direct=1 --numjobs=1 --iodepth=64 --exitall
--group_reporting --ioengine=libaio --runtime=30 --time_based
test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T)
256KiB-256KiB, ioengine=libaio, iodepth=64
fio-3.39-44-g19d9
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=43.6GiB/s][r=179k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=18826: Wed May  7 13:44:38 2025
  read: IOPS=178k, BW=43.4GiB/s (46.7GB/s)(1303GiB/30001msec)
    slat (usec): min=3, max=267, avg= 5.29, stdev= 1.62
    clat (usec): min=147, max=2549, avg=354.18, stdev=28.87
     lat (usec): min=150, max=2657, avg=359.47, stdev=29.15

rocky 9.5

[root@localhost ~]# fio --name=test --rw=read --bs=256k
--filename=/mnt/testfile --direct=1 --numjobs=1 --iodepth=64 --exitall
--group_reporting --ioengine=libaio --runtime=30 --time_based
test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T)
256KiB-256KiB, ioengine=libaio, iodepth=64
fio-3.39-44-g19d9
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=98.3GiB/s][r=403k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=10500: Wed May  7 15:16:39 2025
  read: IOPS=403k, BW=98.4GiB/s (106GB/s)(2951GiB/30001msec)
    slat (nsec): min=1101, max=156185, avg=2087.89, stdev=1415.57
    clat (usec): min=82, max=951, avg=156.56, stdev=20.19
     lat (usec): min=83, max=1078, avg=158.65, stdev=20.25

> Silly question: if you use DM to create the same RAID 0 array with a dm table such as:
> 0 75011629056 striped 12 1024 /dev/nvme7n1 0 /dev/nvme0n1 0 ....  /dev/nvme12n1 0
> to create a similar 38TB raid 0 array, do you see the same perf degradation?

Will check that tomorrow.

Anton

ср, 7 мая 2025 г. в 00:46, Dave Chinner <david@xxxxxxxxxxxxx>:
>
> On Tue, May 06, 2025 at 02:03:37PM +0300, Anton Gavriliuk wrote:
> > > So is this MD chunk size related? i.e. what is the chunk size
> > > the MD device? Is it smaller than the IO size (256kB) or larger?
> > > Does the regression go away if the chunk size matches the IO size,
> > > or if the IO size vs chunk size relationship is reversed?
> >
> > According to the output below, the chunk size is 512K,
>
> Ok.
>
> `iostat -dxm 5` output during the fio run on both kernels will give
> us some indication of the differences in IO patterns, queue depths,
> etc.
>
> Silly question: if you use DM to create the same RAID 0 array
> with a dm table such as:
>
> 0 75011629056 striped 12 1024 /dev/nvme7n1 0 /dev/nvme0n1 0 ....  /dev/nvme12n1 0
>
> to create a similar 38TB raid 0 array, do you see the same perf
> degradation?
>
> -Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
Attachment:
rocky_95_iostat_dxm_5

Description: Binary data
Attachment:
fedora_42_iostat_dxm_5

Description: Binary data