Re: Improper io_opt setting for md raid5

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Mon, 28 Jul 2025 11:08:52 +0800

Hi,

在 2025/07/28 10:41, Damien Le Moal 写道:
On 7/28/25 9:55 AM, Yu Kuai wrote:
Hi,

在 2025/07/28 8:39, Damien Le Moal 写道:
md setting its io_opt to 64K*number of drives in the array is strange... It
does not have to be that large since io_opt is an upper bound and not a "issue
that IO size for optimal performance". io_opt is simply a limit saying: if you
exceed that IO size, performance may suffer.

At least from Documentation, for raid arrays, multiple of io_opt is the
prefereed io size to the optimal io performance, and for raid5, this is
chunksize * data disks.

So a default of stride size x number of drives for the io_opt may be OK, but
that should be bound to some reasonable value. Furthermore, this is likely
suboptimal. I woulld think that setting the md array io_opt initially to
min(all drives io_opt) x number of drives would be a better default.

For raid5, this is not ok, the value have to be chunksize * data disks,
regardless of io_opt from member disks, otherwise raid5 have to issue
additional IO from other disks to build xor data.

For example:

  - write aligned chunksize to one disk, actually means read chunksize
old xor data,then write chunksize data and chunksize new xor data.
  - write aligned chunksize * data disks, new xor data can be build
directly without reading old xor data.

I understand all of that. But you missed my point: io_opt simply indicates an
upper bound for an IO size. If exceeded, performance may be degraded. This has
*nothing* to do with the io granularity, which for a RAID array should ideally
be equal to stride size x number of data disks.

This is the confusion here. md setting io_opt to stride x number of disks in
the array is simply not what io_opt is supposed to indicate.

ok, can I ask where is this upper bound for IO size from?

With git log, start from commit 7e5f5fb09e6f ("block: Update topology
documentation"), the documentation start contain specail explanation for
raid array, and the optimal_io_size says:

For RAID arrays it is usually the
stripe width or the internal track size.  A properly aligned
multiple of optimal_io_size is the preferred request size for
workloads where sustained throughput is desired.

And this explanation is exactly what raid5 did, it's important that
io size is aligned multiple of io_opt.

Thanks,
Kuai