Improper io_opt setting for md raid5

Coly Li <colyli@xxxxxxxxxx> · Tue, 15 Jul 2025 23:56:57 +0800

Hi Christoph,

I don’t know why is the proper person to ask for this issue, but I see your
patch in the questionable code path and I know you are always helpful, so
ask you here.

Let me rescript the problem I encountered.
1, There is an 8 disks raid5 with 64K chunk size on my machine, I observe
/sys/block/md0/queue/optimal_io_size is very large value, which isn’t
reasonable size IMHO.

2,  It was from drivers/scsi/mpt3sas/mpt3sas_scsih.c, 
11939 static const struct scsi_host_template mpt3sas_driver_template = {
11940         .module                         = THIS_MODULE,
11941         .name                           = "Fusion MPT SAS Host",
11942         .proc_name                      = MPT3SAS_DRIVER_NAME,
11943         .queuecommand                   = scsih_qcmd,
11944         .target_alloc                   = scsih_target_alloc,
11945         .sdev_init                      = scsih_sdev_init,
11946         .sdev_configure                 = scsih_sdev_configure,
11947         .target_destroy                 = scsih_target_destroy,
11948         .sdev_destroy                   = scsih_sdev_destroy,
11949         .scan_finished                  = scsih_scan_finished,
11950         .scan_start                     = scsih_scan_start,
11951         .change_queue_depth             = scsih_change_queue_depth,
11952         .eh_abort_handler               = scsih_abort,
11953         .eh_device_reset_handler        = scsih_dev_reset,
11954         .eh_target_reset_handler        = scsih_target_reset,
11955         .eh_host_reset_handler          = scsih_host_reset,
11956         .bios_param                     = scsih_bios_param,
11957         .can_queue                      = 1,
11958         .this_id                        = -1,
11959         .sg_tablesize                   = MPT3SAS_SG_DEPTH,
11960         .max_sectors                    = 32767,
11961         .max_segment_size               = 0xffffffff,
11962         .cmd_per_lun                    = 128,
11963         .shost_groups                   = mpt3sas_host_groups,
11964         .sdev_groups                    = mpt3sas_dev_groups,
11965         .track_queue_depth              = 1,
11966         .cmd_size                       = sizeof(struct scsiio_tracker),
11967         .map_queues                     = scsih_map_queues,
11968         .mq_poll                        = mpt3sas_blk_mq_poll,
11969 };
at line 11960, max_sectors of mpt3sas driver is defined as 32767.

Then in drivers/scsi/scsi_transport_sas.c, at line 241 inside sas_host_setup(),
shots->opt_sectors is assigned by 32767 from the following code,
240         if (dma_dev->dma_mask) {
241                 shost->opt_sectors = min_t(unsigned int, shost->max_sectors,
242                                 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
243         }

Then in drivers/scsi/sd.c, inside sd_revalidate_disk() from the following coce,
3785         /*
3786          * Limit default to SCSI host optimal sector limit if set. There may be
3787          * an impact on performance for when the size of a request exceeds this
3788          * host limit.
3789          */
3790         lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT;
3791         if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
3792                 lim.io_opt = min_not_zero(lim.io_opt,
3793                                 logical_to_bytes(sdp, sdkp->opt_xfer_blocks));
3794         }

lim.io_opt of all my sata disks attached to mpt3sas HBA are all 32767 sectors,
because the above code block.

Then when my raid5 array sets its queue limits, because its io_opt is 64KiB*7,
and the raid component sata hard drive has io_opt with 32767 sectors, by
calculation in block/blk-setting.c:blk_stack_limits() at line 753,
753         t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
the calculated opt_io_size of my raid5 array is more than 1GiB. It is too large.

I know the purpose of lcm_not_zero() is to get an optimized io size for both
raid device and underlying component devices, but the resulted io_opt is bigger
than 1 GiB that's too big.

For me, I just feel uncomfortable that using max_sectors as opt_sectors in
sas_host_stup(), but I don't know a better way to improve. Currently I just
modify the mpt3sas_driver_template's max_sectors from 32767 to 64, and observed
5~10% sequetial write performance improvement (direct io) for my raid5 devices
by fio.

So there should be something to fix. Can you take a look, or give me some hint
to fix?

Thanks in advance.

Coly Li

-- 
Coly Li