Re: Improper io_opt setting for md raid5

Damien Le Moal <dlemoal@xxxxxxxxxx> · Mon, 28 Jul 2025 12:49:45 +0900

On 7/28/25 12:08 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/28 10:41, Damien Le Moal 写道:
>> On 7/28/25 9:55 AM, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/07/28 8:39, Damien Le Moal 写道:
>>>> md setting its io_opt to 64K*number of drives in the array is strange... It
>>>> does not have to be that large since io_opt is an upper bound and not a "issue
>>>> that IO size for optimal performance". io_opt is simply a limit saying: if you
>>>> exceed that IO size, performance may suffer.
>>>>
>>>
>>> At least from Documentation, for raid arrays, multiple of io_opt is the
>>> prefereed io size to the optimal io performance, and for raid5, this is
>>> chunksize * data disks.
>>>
>>>> So a default of stride size x number of drives for the io_opt may be OK, but
>>>> that should be bound to some reasonable value. Furthermore, this is likely
>>>> suboptimal. I woulld think that setting the md array io_opt initially to
>>>> min(all drives io_opt) x number of drives would be a better default.
>>>
>>> For raid5, this is not ok, the value have to be chunksize * data disks,
>>> regardless of io_opt from member disks, otherwise raid5 have to issue
>>> additional IO from other disks to build xor data.
>>>
>>> For example:
>>>
>>>   - write aligned chunksize to one disk, actually means read chunksize
>>> old xor data,then write chunksize data and chunksize new xor data.
>>>   - write aligned chunksize * data disks, new xor data can be build
>>> directly without reading old xor data.
>>
>> I understand all of that. But you missed my point: io_opt simply indicates an
>> upper bound for an IO size. If exceeded, performance may be degraded. This has
>> *nothing* to do with the io granularity, which for a RAID array should ideally
>> be equal to stride size x number of data disks.
>>
>> This is the confusion here. md setting io_opt to stride x number of disks in
>> the array is simply not what io_opt is supposed to indicate.
> 
> ok, can I ask where is this upper bound for IO size from?

SCSI SBC specifications, Block Limits VPD page (B0h):

3 values are important in there:

* OPTIMAL TRANSFER LENGTH GRANULARITY:

An OPTIMAL TRANSFER LENGTH GRANULARITY field set to a non-zero value indicates
the optimal transfer length granularity size in logical blocks for a single
command shown in the command column of table 33. If a device server receives
one of these commands with a transfer size that is not equal to a multiple of
this value, then the device server may incur delays in processing the command.
An OPTIMAL TRANSFER LENGTH GRANULARITY field set to 0000h indicates that the
device server does not report optimal transfer length granularity.

For a SCSI disk, sd.c uses this value for sdkp->min_xfer_blocks. Note that the
naming here is dubious since this is not a minimum. The minimum is the logical
block size. This is a "hint" for better performance. For a RAID area, this
should be the stripe size of the RAID volume (stride x number of data disks).
This value is used for queue->limits.io_min.

* MAXIMUM TRANSFER LENGTH:

A MAXIMUM TRANSFER LENGTH field set to a non-zero value indicates the maximum
transfer length in logical blocks that the device server accepts for a single
command shown in table 33. If a device server receives one of these commands
with a transfer size greater than this value, then the device server shall
terminate the command with CHECK CONDITION status with the sense key set to
ILLEGAL REQUEST and the additional sense code set to the value shown in table
33. A MAXIMUM TRANSFER LENGTH field set to 0000_0000h indicates that the device
server does not report a limit on the transfer length.

For a SCSI disk, sd.c uses this value for sdkp->max_xfer_blocks. This is a hard
limit which will be reflected in queue->limits.max_dev_sectors
(max_hw_sectors_kb in sysfs).

* OPTIMAL TRANSFER LENGTH:

An OPTIMAL TRANSFER LENGTH field set to a non-zero value indicates the optimal
transfer size in logical blocks for a single command shown in table 33. If a
device server receives one of these commands with a transfer size greater than
this value, then the device server may incur delays in processing the command.
An OPTIMAL TRANSFER LENGTH field set to 0000_0000h indicates that the device
server does not report an optimal transfer size.

For a SCSI disk, sd.c uses this value for sdkp->opt_xfer_blocks. This value is
used for queue->limit.io_opt.

> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
> documentation"), the documentation start contain specail explanation for
> raid array, and the optimal_io_size says:
> 
> For RAID arrays it is usually the
> stripe width or the internal track size.  A properly aligned
> multiple of optimal_io_size is the preferred request size for
> workloads where sustained throughput is desired.
> 
> And this explanation is exactly what raid5 did, it's important that
> io size is aligned multiple of io_opt.

Looking at the sysfs doc for the above fields, they are described as follows:

* /sys/block/<disk>/queue/minimum_io_size

[RO] Storage devices may report a granularity or preferred
minimum I/O size which is the smallest request the device can
perform without incurring a performance penalty.  For disk
drives this is often the physical block size.  For RAID arrays
it is often the stripe chunk size.  A properly aligned multiple
of minimum_io_size is the preferred request size for workloads
where a high number of I/O operations is desired.

So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
RAID array, this indeed should be the stride x number of data disks.

* /sys/block/<disk>/queue/max_hw_sectors_kb

[RO] This is the maximum number of kilobytes supported in a
single data transfer.

No problem here.

* /sys/block/<disk>/queue/optimal_io_size

Storage devices may report an optimal I/O size, which is
the device's preferred unit for sustained I/O.  This is rarely
reported for disk drives.  For RAID arrays it is usually the
stripe width or the internal track size.  A properly aligned
multiple of optimal_io_size is the preferred request size for
workloads where sustained throughput is desired.  If no optimal
I/O size is reported this file contains 0.

Well, I find this definition not correct *at all*. This is repeating the
definition of minimum_io_size (limits->io_min) and completely disregard the
eventual optimal_io_size limit of the drives in the array. For a raid array,
this value should obviously be a multiple of minimum_io_size (the array stripe
size), but it can be much larger, since this should be an upper bound for IO
size. read_ahead_kb being set using this value is thus not correct I think.
read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.

-- 
Damien Le Moal
Western Digital Research