On 7/28/25 4:14 PM, Yu Kuai wrote: >>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology >>> documentation"), the documentation start contain specail explanation for >>> raid array, and the optimal_io_size says: >>> >>> For RAID arrays it is usually the >>> stripe width or the internal track size. A properly aligned >>> multiple of optimal_io_size is the preferred request size for >>> workloads where sustained throughput is desired. >>> >>> And this explanation is exactly what raid5 did, it's important that >>> io size is aligned multiple of io_opt. >> >> Looking at the sysfs doc for the above fields, they are described as follows: >> >> * /sys/block/<disk>/queue/minimum_io_size >> >> [RO] Storage devices may report a granularity or preferred >> minimum I/O size which is the smallest request the device can >> perform without incurring a performance penalty. For disk >> drives this is often the physical block size. For RAID arrays >> it is often the stripe chunk size. A properly aligned multiple >> of minimum_io_size is the preferred request size for workloads >> where a high number of I/O operations is desired. >> >> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a >> RAID array, this indeed should be the stride x number of data disks. > > Do you mean stripe here? io_min for raid array is always just one > chunksize. My bad, yes, that is the definition in sysfs. So io_min is the stride size, where: stride size x number of data disks == stripe_size. Note that chunk_sectors limit is the *stripe* size, not per drive stride. Beware of the wording here to avoid confusion (this is all already super confusing !). Well, at least, that is how I interpret the io_min definition of minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For RAID arrays it is often the stripe chunk size." is super confusing. Not entirely sure if stride or stripe was meant here... >> * /sys/block/<disk>/queue/optimal_io_size >> >> Storage devices may report an optimal I/O size, which is >> the device's preferred unit for sustained I/O. This is rarely >> reported for disk drives. For RAID arrays it is usually the >> stripe width or the internal track size. A properly aligned >> multiple of optimal_io_size is the preferred request size for >> workloads where sustained throughput is desired. If no optimal >> I/O size is reported this file contains 0. >> >> Well, I find this definition not correct *at all*. This is repeating the >> definition of minimum_io_size (limits->io_min) and completely disregard the >> eventual optimal_io_size limit of the drives in the array. For a raid array, >> this value should obviously be a multiple of minimum_io_size (the array stripe >> size), but it can be much larger, since this should be an upper bound for IO >> size. read_ahead_kb being set using this value is thus not correct I think. >> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. > > I think this is actually different than io_min, and io_opt for different > levels are not the same, for raid0, raid10, raid456(raid1 doesn't have > chunksize): > - lim.io_min = mddev->chunk_sectors << 9; See above. Given how confusing the definition of minimum_io_size is, not sure that is correct. This code assumes that io_min is the stripe size and not the stride size. > - lim.io_opt = lim.io_min * (number of data copies); I do not understand what you mean with "number of data copies"... There is no data copy in a RAID 5/6 array. > And I think they do match the definition above, specifically: > - properly multiple aligned io_min to *prevent performance penalty*; Yes. > - properly multiple aligned io_opt to *get optimal performance*, the > number of data copies times the performance of a single disk; That is how this field is defined for RAID, but that is far from what it means for a single disk. It is unfortunate that it was defined like that. For a single disk, io_opt is NOT about getting optimal_performance. It is about an upper bound for the IO size to NOT get a performance penalty (e.g. due to a DMA mapping that is too large for what the IOMMU can handle). And for a RAID array, it means that we should always have io_min == io_opt but it seems that the scsi code and limit stacking code try to make this limit an upper bound on the IO size, aligned to the stripe size. > The orginal problem is that scsi disks report unusual io_opt 32767, > and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The > lcm_not_zero() from blk_stack_limits() end up with a huge value: > > blk_stack_limits() > t->io_min = max(t->io_min, b->io_min); > t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); I understand the "problem" that was stated. There is an overflow that result in a large io_opt and a ridiculously large read_ahead_kb. io_opt being large should in my opinion not be an issue in itself, since it should be an upper bound on IO size and not the stripe size (io_min indicates that). >> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. > > The io_opt is used in raid array as minimal aligned size to get optimal > IO performance, not the upper bound. With the respect of this, use this > value for ra_pages make sense. However, if scsi is using this value as > IO upper bound, it's right this doesn't make sense. Here is your issue. People misunderstood optimal_io_size and used that instead of using minimal_io_size/io_min limit for the granularity/alignment of IOs. Using optimal_io_size as the "granularity" for optimal IOs that do not require read-modify-write of RAID stripes is simply wrong in my optinion. io_min/minimal_io_size is the attribute indicating that. As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should be initialized to a smaller value aligned to io_min (if io_opt is unreasonably large). Given all of that and how misused io_opt seems to be, I am not sure how to fix this though. -- Damien Le Moal Western Digital Research