Re: Improper io_opt setting for md raid5

Csordás Hunor <csordas.hunor@xxxxxxxxx> · Sun, 27 Jul 2025 12:50:48 +0200

Adding the SCSI maintainers because I believe the culprit is in
drivers/scsi/sd.c, and Damien Le Moal because because he has a pending
patch modifying the relevant part and he might be interested in the
implications.

On 7/15/2025 5:56 PM, Coly Li wrote:
> Let me rescript the problem I encountered.
> 1, There is an 8 disks raid5 with 64K chunk size on my machine, I observe
> /sys/block/md0/queue/optimal_io_size is very large value, which isn’t
> reasonable size IMHO.

I have come across the same problem after moving all 8 disks of a RAID6
md array from two separate SATA controllers to an mpt3sas device. In my
case, the readahead on the array became almost 4 GB:

# grep ^ /sys/block/{sda,md_helium}/queue/{optimal_io_size,read_ahead_kb}
/sys/block/sda/queue/optimal_io_size:16773120
/sys/block/sda/queue/read_ahead_kb:32760
/sys/block/md_helium/queue/optimal_io_size:4293918720
/sys/block/md_helium/queue/read_ahead_kb:4192256

Note: the readahead is supposed to be twice the optimal I/O size (after
a unit conversion). On the md array it isn't because of an overflow in
blk_apply_bdi_limits. This overflow is avoidable but basically
irrelevant; however, it nicely highlights the fact that io_opt should
really never get this large.

> 2,  It was from drivers/scsi/mpt3sas/mpt3sas_scsih.c, 
> 11939 static const struct scsi_host_template mpt3sas_driver_template = {
...
> 11960         .max_sectors                    = 32767,
...
> 11969 };
> at line 11960, max_sectors of mpt3sas driver is defined as 32767.
> 
> Then in drivers/scsi/scsi_transport_sas.c, at line 241 inside sas_host_setup(),
> shots->opt_sectors is assigned by 32767 from the following code,
> 240         if (dma_dev->dma_mask) {
> 241                 shost->opt_sectors = min_t(unsigned int, shost->max_sectors,
> 242                                 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
> 243         }
> 
> Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce,
> 3785         /*
> 3786          * Limit default to SCSI host optimal sector limit if set. There may be
> 3787          * an impact on performance for when the size of a request exceeds this
> 3788          * host limit.
> 3789          */
> 3790         lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT;
> 3791         if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
> 3792                 lim.io_opt = min_not_zero(lim.io_opt,
> 3793                                 logical_to_bytes(sdp, sdkp->opt_xfer_blocks));
> 3794         }
> 
> lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors,
> because the above code block.
> 
> Then when my raid5 array sets its queue limits, because its io_opt is 64KiB*7,
> and the raid component sata hard drive has io_opt with 32767 sectors, by
> calculation in block/blk-setting.c:blk_stack_limits() at line 753,
> 753         t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> the calculated opt_io_size of my raid5 array is more than 1GiB. It is too large.
> 
> I know the purpose of lcm_not_zero() is to get an optimized io size for both
> raid device and underlying component devices, but the resulted io_opt is bigger
> than 1 GiB that's too big.
> 
> For me, I just feel uncomfortable that using max_sectors as opt_sectors in
> sas_host_stup(), but I don't know a better way to improve. Currently I just
> modify the mpt3sas_driver_template's max_sectors from 32767 to 64, and observed
> 5~10% sequetial write performance improvement (direct io) for my raid5 devices
> by fio.

In my case, the impact was more noticable. The system seemed to work
surprisingly fine under light loads, but an increased number of
parallel I/O operations completely tanked its performance until I
set the readaheads to their expected values and gave the system some
time to recover.

I came to the same conclusion as Coly Li: io_opt ultimately gets
populated from shost->max_sectors, which (in the case of mpt3sas and
several other SCSI controllers) contains a value which is both:
- unnecessarily large for this purpose and, more importantly,
- not a nice number without any large odd divisors, as blk_stack_limits
  clearly expects.

Populating io_opt from shost->max_sectors happens via
shost->opt_sectors. This variable was introduced in commits
608128d391fa ("scsi: sd: allow max_sectors be capped at DMA optimal
size limit") and 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost
opt_sectors according to DMA optimal limit"). Despite the (in hindsight
perhaps unfortunate) name, it wasn't used to set io_opt. It was optimal
in a different sense: it was used as a (user-overridable) upper limit
to max_sectors, constraining the size of requests to play nicely with
IOMMU which might get slow with large mappings.

Commit 608128d391fa even mentions io_opt:

    It could be considered to have request queues io_opt value initially
    set at Scsi_Host.opt_sectors in __scsi_init_queue(), but that is not
    really the purpose of io_opt.

The last part is correct. shost->opt_sectors is an _upper_ bound on the
size of requests, while io_opt is used both as a sort of _lower_ bound
(in the form of readahead), and as a sort of indivisible "block size"
for I/O (by blk_stack_limits). These two existing purposes may or may
not already be too much for a single variable; adding a third one
clearly doesn't work well.

It was commit a23634644afc ("block: take io_opt and io_min into account
for max_sectors") which started setting io_opt from shost->opt_sectors.
It did so to stop abusing max_user_sectors to set max_sectors from
shost->opt_sectors, but it ended up misusing another variable for this
purpose -- perhaps due to inadvertently conflating the two "optimal"
transfer sizes, which are optimal in two very different contexts.

Interestingly, while I've verified that the increased values for io_opt
and readahead on the actual disks definitely comes from this commit
(a23634644afc), the io_opt and readahead of the md array are unaffected
until commit 9c0ba14828d6 ("blk-settings: round down io_opt to
physical_block_size") due to a weird coincidence. This commit rounds
io_opt down to the physical block size in blk_validate_limits. Without
this commit, io_opt for the disks is 16776704, which looks even worse
at first glance (512 * 32767 instead of 4096 * 4095). However, this
ends up overflowing in a funny way when combined with the fact that
blk_stack_limits (and thus lcm_not_zero) is called once per component
device:

u32 t = 3145728; // 3 MB, the optimal I/O size for the array
u32 b = 16776704; // the (incorrect) optimal I/O size of the disks
u32 x = lcm(t, b); // x == (u32)103076069376 == 4291821568
u32 y = lcm(x, b); // y == (u32)140630117318656 == t

Repeat for an even number of component devices to get the right answer
from the wrong inputs by an incorrect method.

I'm sure the issue can be reproduced before commit 9c0ba14828d6
(although I haven't actually tried -- if I had to, I'd start with an
array with an odd number of component devices), but at the same time,
the issue may be still present and hidden on some systems even after
that commit (for example, the rounding does nothing if the physical
block size is 512). This might help a little bit to explain why the
problem doesn't seem more widespread.

> So there should be something to fix. Can you take a look, or give me some hint
> to fix?
> 
> Thanks in advance.
> 
> Coly Li

I would have loved to finish with a patch here but I'm not sure what
the correct fix is. shost->opt_sectors was clearly added for a reason
and it should reach max_sectors in struct queue_limits in some way. It
probably isn't included in max_hw_sectors because it's meant to be
overridable. Apparently just setting max_sectors causes problems, and
so does setting max_sectors and max_user_sectors. I don't know how to
to fix this correctly without introducing a new variable to struct
queue_limits but maybe people more familiar with the code can think of
a less intrusive way.

Hunor Csordás