Adding the SCSI maintainers because I believe the culprit is in drivers/scsi/sd.c, and Damien Le Moal because because he has a pending patch modifying the relevant part and he might be interested in the implications. On 7/15/2025 5:56 PM, Coly Li wrote: > Let me rescript the problem I encountered. > 1, There is an 8 disks raid5 with 64K chunk size on my machine, I observe > /sys/block/md0/queue/optimal_io_size is very large value, which isn’t > reasonable size IMHO. I have come across the same problem after moving all 8 disks of a RAID6 md array from two separate SATA controllers to an mpt3sas device. In my case, the readahead on the array became almost 4 GB: # grep ^ /sys/block/{sda,md_helium}/queue/{optimal_io_size,read_ahead_kb} /sys/block/sda/queue/optimal_io_size:16773120 /sys/block/sda/queue/read_ahead_kb:32760 /sys/block/md_helium/queue/optimal_io_size:4293918720 /sys/block/md_helium/queue/read_ahead_kb:4192256 Note: the readahead is supposed to be twice the optimal I/O size (after a unit conversion). On the md array it isn't because of an overflow in blk_apply_bdi_limits. This overflow is avoidable but basically irrelevant; however, it nicely highlights the fact that io_opt should really never get this large. > 2, It was from drivers/scsi/mpt3sas/mpt3sas_scsih.c, > 11939 static const struct scsi_host_template mpt3sas_driver_template = { ... > 11960 .max_sectors = 32767, ... > 11969 }; > at line 11960, max_sectors of mpt3sas driver is defined as 32767. > > Then in drivers/scsi/scsi_transport_sas.c, at line 241 inside sas_host_setup(), > shots->opt_sectors is assigned by 32767 from the following code, > 240 if (dma_dev->dma_mask) { > 241 shost->opt_sectors = min_t(unsigned int, shost->max_sectors, > 242 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); > 243 } > > Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce, > 3785 /* > 3786 * Limit default to SCSI host optimal sector limit if set. There may be > 3787 * an impact on performance for when the size of a request exceeds this > 3788 * host limit. > 3789 */ > 3790 lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT; > 3791 if (sd_validate_opt_xfer_size(sdkp, dev_max)) { > 3792 lim.io_opt = min_not_zero(lim.io_opt, > 3793 logical_to_bytes(sdp, sdkp->opt_xfer_blocks)); > 3794 } > > lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors, > because the above code block. > > Then when my raid5 array sets its queue limits, because its io_opt is 64KiB*7, > and the raid component sata hard drive has io_opt with 32767 sectors, by > calculation in block/blk-setting.c:blk_stack_limits() at line 753, > 753 t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); > the calculated opt_io_size of my raid5 array is more than 1GiB. It is too large. > > I know the purpose of lcm_not_zero() is to get an optimized io size for both > raid device and underlying component devices, but the resulted io_opt is bigger > than 1 GiB that's too big. > > For me, I just feel uncomfortable that using max_sectors as opt_sectors in > sas_host_stup(), but I don't know a better way to improve. Currently I just > modify the mpt3sas_driver_template's max_sectors from 32767 to 64, and observed > 5~10% sequetial write performance improvement (direct io) for my raid5 devices > by fio. In my case, the impact was more noticable. The system seemed to work surprisingly fine under light loads, but an increased number of parallel I/O operations completely tanked its performance until I set the readaheads to their expected values and gave the system some time to recover. I came to the same conclusion as Coly Li: io_opt ultimately gets populated from shost->max_sectors, which (in the case of mpt3sas and several other SCSI controllers) contains a value which is both: - unnecessarily large for this purpose and, more importantly, - not a nice number without any large odd divisors, as blk_stack_limits clearly expects. Populating io_opt from shost->max_sectors happens via shost->opt_sectors. This variable was introduced in commits 608128d391fa ("scsi: sd: allow max_sectors be capped at DMA optimal size limit") and 4cbfca5f7750 ("scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit"). Despite the (in hindsight perhaps unfortunate) name, it wasn't used to set io_opt. It was optimal in a different sense: it was used as a (user-overridable) upper limit to max_sectors, constraining the size of requests to play nicely with IOMMU which might get slow with large mappings. Commit 608128d391fa even mentions io_opt: It could be considered to have request queues io_opt value initially set at Scsi_Host.opt_sectors in __scsi_init_queue(), but that is not really the purpose of io_opt. The last part is correct. shost->opt_sectors is an _upper_ bound on the size of requests, while io_opt is used both as a sort of _lower_ bound (in the form of readahead), and as a sort of indivisible "block size" for I/O (by blk_stack_limits). These two existing purposes may or may not already be too much for a single variable; adding a third one clearly doesn't work well. It was commit a23634644afc ("block: take io_opt and io_min into account for max_sectors") which started setting io_opt from shost->opt_sectors. It did so to stop abusing max_user_sectors to set max_sectors from shost->opt_sectors, but it ended up misusing another variable for this purpose -- perhaps due to inadvertently conflating the two "optimal" transfer sizes, which are optimal in two very different contexts. Interestingly, while I've verified that the increased values for io_opt and readahead on the actual disks definitely comes from this commit (a23634644afc), the io_opt and readahead of the md array are unaffected until commit 9c0ba14828d6 ("blk-settings: round down io_opt to physical_block_size") due to a weird coincidence. This commit rounds io_opt down to the physical block size in blk_validate_limits. Without this commit, io_opt for the disks is 16776704, which looks even worse at first glance (512 * 32767 instead of 4096 * 4095). However, this ends up overflowing in a funny way when combined with the fact that blk_stack_limits (and thus lcm_not_zero) is called once per component device: u32 t = 3145728; // 3 MB, the optimal I/O size for the array u32 b = 16776704; // the (incorrect) optimal I/O size of the disks u32 x = lcm(t, b); // x == (u32)103076069376 == 4291821568 u32 y = lcm(x, b); // y == (u32)140630117318656 == t Repeat for an even number of component devices to get the right answer from the wrong inputs by an incorrect method. I'm sure the issue can be reproduced before commit 9c0ba14828d6 (although I haven't actually tried -- if I had to, I'd start with an array with an odd number of component devices), but at the same time, the issue may be still present and hidden on some systems even after that commit (for example, the rounding does nothing if the physical block size is 512). This might help a little bit to explain why the problem doesn't seem more widespread. > So there should be something to fix. Can you take a look, or give me some hint > to fix? > > Thanks in advance. > > Coly Li I would have loved to finish with a patch here but I'm not sure what the correct fix is. shost->opt_sectors was clearly added for a reason and it should reach max_sectors in struct queue_limits in some way. It probably isn't included in max_hw_sectors because it's meant to be overridable. Apparently just setting max_sectors causes problems, and so does setting max_sectors and max_user_sectors. I don't know how to to fix this correctly without introducing a new variable to struct queue_limits but maybe people more familiar with the code can think of a less intrusive way. Hunor Csordás