Re: Improper io_opt setting for md raid5

Csordás Hunor <csordas.hunor@xxxxxxxxx> · Mon, 28 Jul 2025 12:56:00 +0200

On 7/28/2025 11:02 AM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/28 15:44, Damien Le Moal 写道:
>> On 7/28/25 4:14 PM, Yu Kuai wrote:
>>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology
>>>>> documentation"), the documentation start contain specail explanation for
>>>>> raid array, and the optimal_io_size says:
>>>>>
>>>>> For RAID arrays it is usually the
>>>>> stripe width or the internal track size.  A properly aligned
>>>>> multiple of optimal_io_size is the preferred request size for
>>>>> workloads where sustained throughput is desired.
>>>>>
>>>>> And this explanation is exactly what raid5 did, it's important that
>>>>> io size is aligned multiple of io_opt.
>>>>
>>>> Looking at the sysfs doc for the above fields, they are described as follows:
>>>>
>>>> * /sys/block/<disk>/queue/minimum_io_size
>>>>
>>>> [RO] Storage devices may report a granularity or preferred
>>>> minimum I/O size which is the smallest request the device can
>>>> perform without incurring a performance penalty.  For disk
>>>> drives this is often the physical block size.  For RAID arrays
>>>> it is often the stripe chunk size.  A properly aligned multiple
>>>> of minimum_io_size is the preferred request size for workloads
>>>> where a high number of I/O operations is desired.
>>>>
>>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a
>>>> RAID array, this indeed should be the stride x number of data disks.
>>>
>>> Do you mean stripe here? io_min for raid array is always just one
>>> chunksize.
>>
>> My bad, yes, that is the definition in sysfs. So io_min is the stride size, where:
>>
>> stride size x number of data disks == stripe_size.
>>
> Yes.
> 
>> Note that chunk_sectors limit is the *stripe* size, not per drive stride.
>> Beware of the wording here to avoid confusion (this is all already super
>> confusing !).
> 
> This is something we're not in the same page :( For example, 8 disks
> raid5, with default chunk size. Then the above calculation is:
> 
> 64k * 7 = 448k
> 
> The chunksize I said is 64k...
>>
>> Well, at least, that is how I interpret the io_min definition of
>> minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For
>> RAID arrays it is often the stripe chunk size." is super confusing. Not
>> entirely sure if stride or stripe was meant here...
>>
> 
> Hope it's clear now.
>>
>>>> * /sys/block/<disk>/queue/optimal_io_size
>>>>
>>>> Storage devices may report an optimal I/O size, which is
>>>> the device's preferred unit for sustained I/O.  This is rarely
>>>> reported for disk drives.  For RAID arrays it is usually the
>>>> stripe width or the internal track size.  A properly aligned
>>>> multiple of optimal_io_size is the preferred request size for
>>>> workloads where sustained throughput is desired.  If no optimal
>>>> I/O size is reported this file contains 0.
>>>>
>>>> Well, I find this definition not correct *at all*. This is repeating the
>>>> definition of minimum_io_size (limits->io_min) and completely disregard the
>>>> eventual optimal_io_size limit of the drives in the array. For a raid array,
>>>> this value should obviously be a multiple of minimum_io_size (the array stripe
>>>> size), but it can be much larger, since this should be an upper bound for IO
>>>> size. read_ahead_kb being set using this value is thus not correct I think.
>>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
>>>
>>> I think this is actually different than io_min, and io_opt for different
>>> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have
>>> chunksize):
>>>   - lim.io_min = mddev->chunk_sectors << 9;
> 
> By the above example, io_min = 64k, and io_opt = 448k. And make sure
> we're on the same page, io_min is the *stride* and io_opt is the
> *stripe*.
>>
>> See above. Given how confusing the definition of minimum_io_size is, not sure
>> that is correct. This code assumes that io_min is the stripe size and not the
>> stride size.
>>
>>>   - lim.io_opt = lim.io_min * (number of data copies);
>>
>> I do not understand what you mean with "number of data copies"... There is no
>> data copy in a RAID 5/6 array.
> 
> Yes, this is my bad, *data disks* is the better word.
>>
>>> And I think they do match the definition above, specifically:
>>>   - properly multiple aligned io_min to *prevent performance penalty*;
>>
>> Yes.
>>
>>>   - properly multiple aligned io_opt to *get optimal performance*, the
>>>     number of data copies times the performance of a single disk;
>>
>> That is how this field is defined for RAID, but that is far from what it means
>> for a single disk. It is unfortunate that it was defined like that.
>>
>> For a single disk, io_opt is NOT about getting optimal_performance. It is about
>> an upper bound for the IO size to NOT get a performance penalty (e.g. due to a
>> DMA mapping that is too large for what the IOMMU can handle).
> 
> The name itself is misleading. :( I didn't know this definition until
> now.
> 
>>
>> And for a RAID array, it means that we should always have io_min == io_opt but
>> it seems that the scsi code and limit stacking code try to make this limit an
>> upper bound on the IO size, aligned to the stripe size.
>>
>>> The orginal problem is that scsi disks report unusual io_opt 32767,
>>> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The
>>> lcm_not_zero() from blk_stack_limits() end up with a huge value:
>>>
>>> blk_stack_limits()
>>>   t->io_min = max(t->io_min, b->io_min);
>>>   t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
>>
>> I understand the "problem" that was stated. There is an overflow that result in
>> a large io_opt and a ridiculously large read_ahead_kb.
>> io_opt being large should in my opinion not be an issue in itself, since it
>> should be an upper bound on IO size and not the stripe size (io_min indicates
>> that).
>>
>>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size.
>>>
>>> The io_opt is used in raid array as minimal aligned size to get optimal
>>> IO performance, not the upper bound. With the respect of this, use this
>>> value for ra_pages make sense. However, if scsi is using this value as
>>> IO upper bound, it's right this doesn't make sense.
>>
>> Here is your issue. People misunderstood optimal_io_size and used that instead
>> of using minimal_io_size/io_min limit for the granularity/alignment of IOs.
>> Using optimal_io_size as the "granularity" for optimal IOs that do not require
>> read-modify-write of RAID stripes is simply wrong in my optinion.
>> io_min/minimal_io_size is the attribute indicating that.

I'm not familiar enough with the all the code using io_opt to be certain,
but I think it's a little bit the other way around.

The problem definitely seems to be that we want to use one variable for
multiple different purposes. io_opt in struct queue_limits is the "optimal
I/O size", but "optimal" can mean many things in different contexts. In
general, if I want to do some I/O, I can say that the optimal way to do it
is to make I/O requests that satisfy some condition. The condition can be
many things:

- The request size should be at least X (combining two small requests into
  a big one may be faster, or extending a small request into a bigger one
  may avoid having to do another request later).
- The request size should be at most X (for example, because DMA is
  inefficient on this system with request sizes too large -- this is the
  _only_ thing that shost->opt_sectors is currently set for).
- The request size should be a multiple of X, _and_ the request should
  have an alignment of X (this is what a striped md array wants).

And, of course, there can be many other "optimality" conditions. The ones
listed above all have a parameter (X), which can, in the context of that
condition, be called the "optimal I/O size". These parameters, however,
can be very different for each condition; the right parameter for one
condition can be completely inappropriate for another.

The SCSI standard may have a definition for "optimal transfer length",
but io_opt in struct queue_length seems to be used for a different purpose
since its introduction in 2009:

- It was introduced in commit c72758f33784 ("block: Export I/O topology
  for block devices and partitions"). The commit message specifically
  mentions its use by md:
      The io_opt characteristic indicates the optimal I/O size reported by
      the device.  This is usually the stripe width for arrays.
- It actually started being set by md in commit 8f6c2e4b325a ("md: Use new
  topology calls to indicate alignment and I/O sizes"). The commit date
  is more than a month later than the last but as far as I can see, they
  were originally posted in the same patch series:
  https://lore.kernel.org/all/20090424123054.GA1926@xxxxxxxxxxxxxxxx/T/#u
  In the context of that patch series, md was literally the first user
  of io_opt.
- Using the lowest common multiple of the component devices and the
  array to calculate the final io_opt of the array happened in
  commit 70dd5bf3b999 ("block: Stack optimal I/O size"), still in 2009.

It wasn't until commit a23634644afc ("block: take io_opt and io_min into
account for max_sectors" in 2024 that sd_revalidate_disk started to set
io_opt from what, in the context of the SCSI standard, should be called
the optimal I/O size. However, in doing so, it broke md arrays. With my
setup, this was hidden until commit 9c0ba14828d6 ("blk-settings: round
down io_opt to physical_block_size"), but nonetheless, it happened here.

> Ok, looks like there are two problems now:
> 
> a) io_min, size to prevent performance penalty;
> 
>  1) For raid5, to avoid read-modify-write, this value should be 448k,
>     but now it's 64k;
>  2) For raid0/raid10, this value is set to 64k now, however, this value
>     should not set. If the value in member disks is 4k, issue 4k is just
>     fine, there won't be any performance penalty;
>  3) For raid1, this value is not set, and will use member disks, this is
>     correct.
> 
> b) io_opt, size to ???
>  4) For raid0/raid10/rai5, this value is set to mininal IO size to get
>     best performance.
>  5) For raid1, this value is not set, and will use member disks.
> 
> Problem a can be fixed easily, and for problem b, I'm not sure how to
> fix it as well, it depends on how we think io_opt is.
> 
> If io_opt should be *upper bound*, problem 4) should be fixed like case
> 5), and other places like blk_apply_bdi_limits() setting ra_pages by
> io_opt should be fixed as well.
> 
> If io_opt should be *mininal IO size to get best performance*, problem
> 5) should be fixed like case 4), and I don't know if scsi or other
> drivers to set initial io_opt should be changed. :(
> 
> Thanks,
> Kuai
> 
>>
>> As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should
>> be initialized to a smaller value aligned to io_min (if io_opt is unreasonably
>> large).
>>
>> Given all of that and how misused io_opt seems to be, I am not sure how to fix
>> this though.
>>
> 

Hunor Csordás