Re: Improper io_opt setting for md raid5

Damien Le Moal <dlemoal@xxxxxxxxxx> · Mon, 28 Jul 2025 09:39:35 +0900

On 7/27/25 7:50 PM, Csordás Hunor wrote:
> Adding the SCSI maintainers because I believe the culprit is in
> drivers/scsi/sd.c, and Damien Le Moal because because he has a pending
> patch modifying the relevant part and he might be interested in the
> implications.
> 
> On 7/15/2025 5:56 PM, Coly Li wrote:
>> Let me rescript the problem I encountered.
>> 1, There is an 8 disks raid5 with 64K chunk size on my machine, I observe
>> /sys/block/md0/queue/optimal_io_size is very large value, which isn’t
>> reasonable size IMHO.
> 
> I have come across the same problem after moving all 8 disks of a RAID6
> md array from two separate SATA controllers to an mpt3sas device. In my
> case, the readahead on the array became almost 4 GB:
> 
> # grep ^ /sys/block/{sda,md_helium}/queue/{optimal_io_size,read_ahead_kb}
> /sys/block/sda/queue/optimal_io_size:16773120
> /sys/block/sda/queue/read_ahead_kb:32760

For a SATA drive connected to an mpt3sas HBA, I see the same. But note that the
optimal_io_size here is completely made up by the HBA/driver because ATA does
not advertize/define an optimal IO size.

For SATA drive connected to AHCI SATA ports, I see:

/sys/block/sda/queue/optimal_io_size:0
/sys/block/sda/queue/read_ahead_kb:8192

read_ahead_kb in this case is twice max_sectors_kb (which with my patch is now
4MB).

> /sys/block/md_helium/queue/optimal_io_size:4293918720
> /sys/block/md_helium/queue/read_ahead_kb:4192256
> 
> Note: the readahead is supposed to be twice the optimal I/O size (after
> a unit conversion). On the md array it isn't because of an overflow in
> blk_apply_bdi_limits. This overflow is avoidable but basically
> irrelevant; however, it nicely highlights the fact that io_opt should
> really never get this large.

Only if io_opt is non-zero. If io_opt is zero, then read_ahead_kb by default is
twice max_sectors_kb.

> 
>> 2,  It was from drivers/scsi/mpt3sas/mpt3sas_scsih.c, 
>> 11939 static const struct scsi_host_template mpt3sas_driver_template = {
> ...
>> 11960         .max_sectors                    = 32767,
> ...
>> 11969 };
>> at line 11960, max_sectors of mpt3sas driver is defined as 32767.

This is another completely made-up value since SCSI allows commands transfer
length up to 4GB (32-bits value in bytes). Even ATA drives allow up to 65536
logical sectors per command (so 65536 * 4K = 256MB per command for $k logical
sector drives). Not sure why it is set to this completely arbitrary value.

>> Then in drivers/scsi/scsi_transport_sas.c, at line 241 inside sas_host_setup(),
>> shots->opt_sectors is assigned by 32767 from the following code,
>> 240         if (dma_dev->dma_mask) {
>> 241                 shost->opt_sectors = min_t(unsigned int, shost->max_sectors,
>> 242                                 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
>> 243         }
>>
>> Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce,
>> 3785         /*
>> 3786          * Limit default to SCSI host optimal sector limit if set. There may be
>> 3787          * an impact on performance for when the size of a request exceeds this
>> 3788          * host limit.
>> 3789          */
>> 3790         lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT;
>> 3791         if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
>> 3792                 lim.io_opt = min_not_zero(lim.io_opt,
>> 3793                                 logical_to_bytes(sdp, sdkp->opt_xfer_blocks));
>> 3794         }
>>
>> lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors,
>> because the above code block.
>>
>> Then when my raid5 array sets its queue limits, because its io_opt is 64KiB*7,
>> and the raid component sata hard drive has io_opt with 32767 sectors, by
>> calculation in block/blk-setting.c:blk_stack_limits() at line 753,
>> 753         t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
>> the calculated opt_io_size of my raid5 array is more than 1GiB. It is too large.

md setting its io_opt to 64K*number of drives in the array is strange... It
does not have to be that large since io_opt is an upper bound and not a "issue
that IO size for optimal performance". io_opt is simply a limit saying: if you
exceed that IO size, performance may suffer.

So a default of stride size x number of drives for the io_opt may be OK, but
that should be bound to some reasonable value. Furthermore, this is likely
suboptimal. I woulld think that setting the md array io_opt initially to
min(all drives io_opt) x number of drives would be a better default.

>> I know the purpose of lcm_not_zero() is to get an optimized io size for both
>> raid device and underlying component devices, but the resulted io_opt is bigger
>> than 1 GiB that's too big.
>>
>> For me, I just feel uncomfortable that using max_sectors as opt_sectors in
>> sas_host_stup(), but I don't know a better way to improve. Currently I just
>> modify the mpt3sas_driver_template's max_sectors from 32767 to 64, and observed
>> 5~10% sequetial write performance improvement (direct io) for my raid5 devices
>> by fio.
> 
> In my case, the impact was more noticable. The system seemed to work
> surprisingly fine under light loads, but an increased number of
> parallel I/O operations completely tanked its performance until I
> set the readaheads to their expected values and gave the system some
> time to recover.
> 
> I came to the same conclusion as Coly Li: io_opt ultimately gets
> populated from shost->max_sectors, which (in the case of mpt3sas and
> several other SCSI controllers) contains a value which is both:
> - unnecessarily large for this purpose and, more importantly,
> - not a nice number without any large odd divisors, as blk_stack_limits
>   clearly expects.

Sounds to me like this is an md driver issue and tweak the limits automatically
calculated by stacking the limits of the array members.

> Populating io_opt from shost->max_sectors happens via
> shost->opt_sectors. This variable was introduced in commits
> 608128d391fa ("scsi: sd: allow max_sectors be capped at DMA optimal
> size limit") and 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost
> opt_sectors according to DMA optimal limit"). Despite the (in hindsight
> perhaps unfortunate) name, it wasn't used to set io_opt. It was optimal
> in a different sense: it was used as a (user-overridable) upper limit
> to max_sectors, constraining the size of requests to play nicely with
> IOMMU which might get slow with large mappings.
> 
> Commit 608128d391fa even mentions io_opt:
> 
>     It could be considered to have request queues io_opt value initially
>     set at Scsi_Host.opt_sectors in __scsi_init_queue(), but that is not
>     really the purpose of io_opt.
> 
> The last part is correct. shost->opt_sectors is an _upper_ bound on the
> size of requests, while io_opt is used both as a sort of _lower_ bound
> (in the form of readahead), and as a sort of indivisible "block size"
> for I/O (by blk_stack_limits). These two existing purposes may or may
> not already be too much for a single variable; adding a third one
> clearly doesn't work well.
> 
> It was commit a23634644afc ("block: take io_opt and io_min into account
> for max_sectors") which started setting io_opt from shost->opt_sectors.
> It did so to stop abusing max_user_sectors to set max_sectors from
> shost->opt_sectors, but it ended up misusing another variable for this
> purpose -- perhaps due to inadvertently conflating the two "optimal"
> transfer sizes, which are optimal in two very different contexts.
> 
> Interestingly, while I've verified that the increased values for io_opt
> and readahead on the actual disks definitely comes from this commit
> (a23634644afc), the io_opt and readahead of the md array are unaffected
> until commit 9c0ba14828d6 ("blk-settings: round down io_opt to
> physical_block_size") due to a weird coincidence. This commit rounds
> io_opt down to the physical block size in blk_validate_limits. Without
> this commit, io_opt for the disks is 16776704, which looks even worse
> at first glance (512 * 32767 instead of 4096 * 4095). However, this
> ends up overflowing in a funny way when combined with the fact that
> blk_stack_limits (and thus lcm_not_zero) is called once per component
> device:
> 
> u32 t = 3145728; // 3 MB, the optimal I/O size for the array
> u32 b = 16776704; // the (incorrect) optimal I/O size of the disks

It is not incorrect. It is a made-up value. For a SATA drive, reporting 0 would
be the correct thing to do.

> u32 x = lcm(t, b); // x == (u32)103076069376 == 4291821568
> u32 y = lcm(x, b); // y == (u32)140630117318656 == t
> 
> Repeat for an even number of component devices to get the right answer
> from the wrong inputs by an incorrect method.
> 
> I'm sure the issue can be reproduced before commit 9c0ba14828d6
> (although I haven't actually tried -- if I had to, I'd start with an
> array with an odd number of component devices), but at the same time,
> the issue may be still present and hidden on some systems even after
> that commit (for example, the rounding does nothing if the physical
> block size is 512). This might help a little bit to explain why the
> problem doesn't seem more widespread.
> 
>> So there should be something to fix. Can you take a look, or give me some hint
>> to fix?
>>
>> Thanks in advance.
>>
>> Coly Li
> 
> I would have loved to finish with a patch here but I'm not sure what
> the correct fix is. shost->opt_sectors was clearly added for a reason
> and it should reach max_sectors in struct queue_limits in some way. It
> probably isn't included in max_hw_sectors because it's meant to be
> overridable. Apparently just setting max_sectors causes problems, and
> so does setting max_sectors and max_user_sectors. I don't know how to
> to fix this correctly without introducing a new variable to struct
> queue_limits but maybe people more familiar with the code can think of
> a less intrusive way.
> 
> Hunor Csordás
> 

-- 
Damien Le Moal
Western Digital Research