Re: Improper io_opt setting for md raid5

Damien Le Moal <dlemoal@xxxxxxxxxx> · Mon, 28 Jul 2025 16:44:14 +0900

On 7/28/25 4:14 PM, Yu Kuai wrote:
>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>>> documentation"), the documentation start contain specail explanation for
>>> raid array, and the optimal_io_size says:
>>>
>>> For RAID arrays it is usually the
>>> stripe width or the internal track size.  A properly aligned
>>> multiple of optimal_io_size is the preferred request size for
>>> workloads where sustained throughput is desired.
>>>
>>> And this explanation is exactly what raid5 did, it's important that
>>> io size is aligned multiple of io_opt.
>>
>> Looking at the sysfs doc for the above fields, they are described as follows:
>>
>> * /sys/block/<disk>/queue/minimum_io_size
>>
>> [RO] Storage devices may report a granularity or preferred
>> minimum I/O size which is the smallest request the device can
>> perform without incurring a performance penalty.  For disk
>> drives this is often the physical block size.  For RAID arrays
>> it is often the stripe chunk size.  A properly aligned multiple
>> of minimum_io_size is the preferred request size for workloads
>> where a high number of I/O operations is desired.
>>
>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
>> RAID array, this indeed should be the stride x number of data disks.
> 
> Do you mean stripe here? io_min for raid array is always just one
> chunksize.

My bad, yes, that is the definition in sysfs. So io_min is the stride size, where:

stride size x number of data disks == stripe_size.

Note that chunk_sectors limit is the *stripe* size, not per drive stride.
Beware of the wording here to avoid confusion (this is all already super
confusing !).

Well, at least, that is how I interpret the io_min definition of
minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For
RAID arrays it is often the stripe chunk size." is super confusing. Not
entirely sure if stride or stripe was meant here...

>> * /sys/block/<disk>/queue/optimal_io_size
>>
>> Storage devices may report an optimal I/O size, which is
>> the device's preferred unit for sustained I/O.  This is rarely
>> reported for disk drives.  For RAID arrays it is usually the
>> stripe width or the internal track size.  A properly aligned
>> multiple of optimal_io_size is the preferred request size for
>> workloads where sustained throughput is desired.  If no optimal
>> I/O size is reported this file contains 0.
>>
>> Well, I find this definition not correct *at all*. This is repeating the
>> definition of minimum_io_size (limits->io_min) and completely disregard the
>> eventual optimal_io_size limit of the drives in the array. For a raid array,
>> this value should obviously be a multiple of minimum_io_size (the array stripe
>> size), but it can be much larger, since this should be an upper bound for IO
>> size. read_ahead_kb being set using this value is thus not correct I think.
>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
> 
> I think this is actually different than io_min, and io_opt for different
> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have
> chunksize):
>  - lim.io_min = mddev->chunk_sectors << 9;

See above. Given how confusing the definition of minimum_io_size is, not sure
that is correct. This code assumes that io_min is the stripe size and not the
stride size.

>  - lim.io_opt = lim.io_min * (number of data copies);

I do not understand what you mean with "number of data copies"... There is no
data copy in a RAID 5/6 array.

> And I think they do match the definition above, specifically:
>  - properly multiple aligned io_min to *prevent performance penalty*;

Yes.

>  - properly multiple aligned io_opt to *get optimal performance*, the
>    number of data copies times the performance of a single disk;

That is how this field is defined for RAID, but that is far from what it means
for a single disk. It is unfortunate that it was defined like that.

For a single disk, io_opt is NOT about getting optimal_performance. It is about
an upper bound for the IO size to NOT get a performance penalty (e.g. due to a
DMA mapping that is too large for what the IOMMU can handle).

And for a RAID array, it means that we should always have io_min == io_opt but
it seems that the scsi code and limit stacking code try to make this limit an
upper bound on the IO size, aligned to the stripe size.

> The orginal problem is that scsi disks report unusual io_opt 32767,
> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The
> lcm_not_zero() from blk_stack_limits() end up with a huge value:
> 
> blk_stack_limits()
>  t->io_min = max(t->io_min, b->io_min);
>  t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);

I understand the "problem" that was stated. There is an overflow that result in
a large io_opt and a ridiculously large read_ahead_kb.
io_opt being large should in my opinion not be an issue in itself, since it
should be an upper bound on IO size and not the stripe size (io_min indicates
that).

>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
> 
> The io_opt is used in raid array as minimal aligned size to get optimal
> IO performance, not the upper bound. With the respect of this, use this
> value for ra_pages make sense. However, if scsi is using this value as
> IO upper bound, it's right this doesn't make sense.

Here is your issue. People misunderstood optimal_io_size and used that instead
of using minimal_io_size/io_min limit for the granularity/alignment of IOs.
Using optimal_io_size as the "granularity" for optimal IOs that do not require
read-modify-write of RAID stripes is simply wrong in my optinion.
io_min/minimal_io_size is the attribute indicating that.

As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should
be initialized to a smaller value aligned to io_min (if io_opt is unreasonably
large).

Given all of that and how misused io_opt seems to be, I am not sure how to fix
this though.

-- 
Damien Le Moal
Western Digital Research