On 7/28/2025 11:02 AM, Yu Kuai wrote: > Hi, > > 在 2025/07/28 15:44, Damien Le Moal 写道: >> On 7/28/25 4:14 PM, Yu Kuai wrote: >>>>> With git log, start from commit 7e5f5fb09e6f ("block: Update topology >>>>> documentation"), the documentation start contain specail explanation for >>>>> raid array, and the optimal_io_size says: >>>>> >>>>> For RAID arrays it is usually the >>>>> stripe width or the internal track size. A properly aligned >>>>> multiple of optimal_io_size is the preferred request size for >>>>> workloads where sustained throughput is desired. >>>>> >>>>> And this explanation is exactly what raid5 did, it's important that >>>>> io size is aligned multiple of io_opt. >>>> >>>> Looking at the sysfs doc for the above fields, they are described as follows: >>>> >>>> * /sys/block/<disk>/queue/minimum_io_size >>>> >>>> [RO] Storage devices may report a granularity or preferred >>>> minimum I/O size which is the smallest request the device can >>>> perform without incurring a performance penalty. For disk >>>> drives this is often the physical block size. For RAID arrays >>>> it is often the stripe chunk size. A properly aligned multiple >>>> of minimum_io_size is the preferred request size for workloads >>>> where a high number of I/O operations is desired. >>>> >>>> So this matches the SCSI limit OPTIMAL TRANSFER LENGTH GRANULARITY and for a >>>> RAID array, this indeed should be the stride x number of data disks. >>> >>> Do you mean stripe here? io_min for raid array is always just one >>> chunksize. >> >> My bad, yes, that is the definition in sysfs. So io_min is the stride size, where: >> >> stride size x number of data disks == stripe_size. >> > Yes. > >> Note that chunk_sectors limit is the *stripe* size, not per drive stride. >> Beware of the wording here to avoid confusion (this is all already super >> confusing !). > > This is something we're not in the same page :( For example, 8 disks > raid5, with default chunk size. Then the above calculation is: > > 64k * 7 = 448k > > The chunksize I said is 64k... >> >> Well, at least, that is how I interpret the io_min definition of >> minimum_io_size in Documentation/ABI/stable/sysfs-block. But the wording "For >> RAID arrays it is often the stripe chunk size." is super confusing. Not >> entirely sure if stride or stripe was meant here... >> > > Hope it's clear now. >> >>>> * /sys/block/<disk>/queue/optimal_io_size >>>> >>>> Storage devices may report an optimal I/O size, which is >>>> the device's preferred unit for sustained I/O. This is rarely >>>> reported for disk drives. For RAID arrays it is usually the >>>> stripe width or the internal track size. A properly aligned >>>> multiple of optimal_io_size is the preferred request size for >>>> workloads where sustained throughput is desired. If no optimal >>>> I/O size is reported this file contains 0. >>>> >>>> Well, I find this definition not correct *at all*. This is repeating the >>>> definition of minimum_io_size (limits->io_min) and completely disregard the >>>> eventual optimal_io_size limit of the drives in the array. For a raid array, >>>> this value should obviously be a multiple of minimum_io_size (the array stripe >>>> size), but it can be much larger, since this should be an upper bound for IO >>>> size. read_ahead_kb being set using this value is thus not correct I think. >>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. >>> >>> I think this is actually different than io_min, and io_opt for different >>> levels are not the same, for raid0, raid10, raid456(raid1 doesn't have >>> chunksize): >>> - lim.io_min = mddev->chunk_sectors << 9; > > By the above example, io_min = 64k, and io_opt = 448k. And make sure > we're on the same page, io_min is the *stride* and io_opt is the > *stripe*. >> >> See above. Given how confusing the definition of minimum_io_size is, not sure >> that is correct. This code assumes that io_min is the stripe size and not the >> stride size. >> >>> - lim.io_opt = lim.io_min * (number of data copies); >> >> I do not understand what you mean with "number of data copies"... There is no >> data copy in a RAID 5/6 array. > > Yes, this is my bad, *data disks* is the better word. >> >>> And I think they do match the definition above, specifically: >>> - properly multiple aligned io_min to *prevent performance penalty*; >> >> Yes. >> >>> - properly multiple aligned io_opt to *get optimal performance*, the >>> number of data copies times the performance of a single disk; >> >> That is how this field is defined for RAID, but that is far from what it means >> for a single disk. It is unfortunate that it was defined like that. >> >> For a single disk, io_opt is NOT about getting optimal_performance. It is about >> an upper bound for the IO size to NOT get a performance penalty (e.g. due to a >> DMA mapping that is too large for what the IOMMU can handle). > > The name itself is misleading. :( I didn't know this definition until > now. > >> >> And for a RAID array, it means that we should always have io_min == io_opt but >> it seems that the scsi code and limit stacking code try to make this limit an >> upper bound on the IO size, aligned to the stripe size. >> >>> The orginal problem is that scsi disks report unusual io_opt 32767, >>> and raid5 set io_opt to 64k * 7(8 disks with 64k chunksise). The >>> lcm_not_zero() from blk_stack_limits() end up with a huge value: >>> >>> blk_stack_limits() >>> t->io_min = max(t->io_min, b->io_min); >>> t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); >> >> I understand the "problem" that was stated. There is an overflow that result in >> a large io_opt and a ridiculously large read_ahead_kb. >> io_opt being large should in my opinion not be an issue in itself, since it >> should be an upper bound on IO size and not the stripe size (io_min indicates >> that). >> >>>> read_ahead_kb should use max_sectors_kb, with alignment to minimum_io_size. >>> >>> The io_opt is used in raid array as minimal aligned size to get optimal >>> IO performance, not the upper bound. With the respect of this, use this >>> value for ra_pages make sense. However, if scsi is using this value as >>> IO upper bound, it's right this doesn't make sense. >> >> Here is your issue. People misunderstood optimal_io_size and used that instead >> of using minimal_io_size/io_min limit for the granularity/alignment of IOs. >> Using optimal_io_size as the "granularity" for optimal IOs that do not require >> read-modify-write of RAID stripes is simply wrong in my optinion. >> io_min/minimal_io_size is the attribute indicating that. I'm not familiar enough with the all the code using io_opt to be certain, but I think it's a little bit the other way around. The problem definitely seems to be that we want to use one variable for multiple different purposes. io_opt in struct queue_limits is the "optimal I/O size", but "optimal" can mean many things in different contexts. In general, if I want to do some I/O, I can say that the optimal way to do it is to make I/O requests that satisfy some condition. The condition can be many things: - The request size should be at least X (combining two small requests into a big one may be faster, or extending a small request into a bigger one may avoid having to do another request later). - The request size should be at most X (for example, because DMA is inefficient on this system with request sizes too large -- this is the _only_ thing that shost->opt_sectors is currently set for). - The request size should be a multiple of X, _and_ the request should have an alignment of X (this is what a striped md array wants). And, of course, there can be many other "optimality" conditions. The ones listed above all have a parameter (X), which can, in the context of that condition, be called the "optimal I/O size". These parameters, however, can be very different for each condition; the right parameter for one condition can be completely inappropriate for another. The SCSI standard may have a definition for "optimal transfer length", but io_opt in struct queue_length seems to be used for a different purpose since its introduction in 2009: - It was introduced in commit c72758f33784 ("block: Export I/O topology for block devices and partitions"). The commit message specifically mentions its use by md: The io_opt characteristic indicates the optimal I/O size reported by the device. This is usually the stripe width for arrays. - It actually started being set by md in commit 8f6c2e4b325a ("md: Use new topology calls to indicate alignment and I/O sizes"). The commit date is more than a month later than the last but as far as I can see, they were originally posted in the same patch series: https://lore.kernel.org/all/20090424123054.GA1926@xxxxxxxxxxxxxxxx/T/#u In the context of that patch series, md was literally the first user of io_opt. - Using the lowest common multiple of the component devices and the array to calculate the final io_opt of the array happened in commit 70dd5bf3b999 ("block: Stack optimal I/O size"), still in 2009. It wasn't until commit a23634644afc ("block: take io_opt and io_min into account for max_sectors" in 2024 that sd_revalidate_disk started to set io_opt from what, in the context of the SCSI standard, should be called the optimal I/O size. However, in doing so, it broke md arrays. With my setup, this was hidden until commit 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size"), but nonetheless, it happened here. > Ok, looks like there are two problems now: > > a) io_min, size to prevent performance penalty; > > 1) For raid5, to avoid read-modify-write, this value should be 448k, > but now it's 64k; > 2) For raid0/raid10, this value is set to 64k now, however, this value > should not set. If the value in member disks is 4k, issue 4k is just > fine, there won't be any performance penalty; > 3) For raid1, this value is not set, and will use member disks, this is > correct. > > b) io_opt, size to ??? > 4) For raid0/raid10/rai5, this value is set to mininal IO size to get > best performance. > 5) For raid1, this value is not set, and will use member disks. > > Problem a can be fixed easily, and for problem b, I'm not sure how to > fix it as well, it depends on how we think io_opt is. > > If io_opt should be *upper bound*, problem 4) should be fixed like case > 5), and other places like blk_apply_bdi_limits() setting ra_pages by > io_opt should be fixed as well. > > If io_opt should be *mininal IO size to get best performance*, problem > 5) should be fixed like case 4), and I don't know if scsi or other > drivers to set initial io_opt should be changed. :( > > Thanks, > Kuai > >> >> As for read_ahead_kb, it should be bounded by io_opt (upper bound) but should >> be initialized to a smaller value aligned to io_min (if io_opt is unreasonably >> large). >> >> Given all of that and how misused io_opt seems to be, I am not sure how to fix >> this though. >> > Hunor Csordás