Re: Improper io_opt setting for md raid5

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Tue, 29 Jul 2025 14:25:07 +0800

Hi, Martin

在 2025/07/29 12:23, Martin K. Petersen 写道:

Ok, looks like there are two problems now:

a) io_min, size to prevent performance penalty;

  1) For raid5, to avoid read-modify-write, this value should be 448k,
     but now it's 64k;

You have two penalties for RAID5: Writes smaller than the stripe chunk
size and writes smaller than the full stripe width.

Yes, the internal IO size for raid5 is 4k, however, only full stripe
write can prevent read-modify-write, which is 448k.

  2) For raid0/raid10, this value is set to 64k now, however, this value
     should not set. If the value in member disks is 4k, issue 4k is just
     fine, there won't be any performance penalty;

Correct.

  3) For raid1, this value is not set, and will use member disks, this is
     correct.

Correct.

b) io_opt, size to ???
  4) For raid0/raid10/rai5, this value is set to mininal IO size to get
     best performance.

For RAID 0 you want to set io_opt to the stripe width. io_opt is for
sequential, throughput-optimized I/O. Presumably the MD stripe chunk
size has been chosen based on knowledge about the underlying disks and
their performance. And thus maximum throughput will be achieved when
doing full stripe writes across all drives.

Yes, raid0/raid10/raid5 are all the same logic, multiple aligned
sequential IO can get the number of data disks times sigle disk
performance.

For software RAID I am not sure how much this really matters in a modern
context. It certainly did 25 years ago when we benchmarked things for
XFS. Full stripe writes were a big improvement with both software and
hardware RAID. But how much this matters today, I am not sure.

For raid1, write will be less than single disk performance. However, for
read, the io_opt should be the sum of io_opt of member disks, see
should_choose_next(), for sequential read, raid1 will switch to next
rdev to read after reading io_opf of this rdev.

  5) For raid1, this value is not set, and will use member disks.

Correct.

If io_opt should be *upper bound*, problem 4) should be fixed like case
5), and other places like blk_apply_bdi_limits() setting ra_pages by
io_opt should be fixed as well.

I understand Damien's "upper bound" interpretation but it does not take
alignment and granularity into account. And both are imperative for
io_opt.

If io_opt should be *mininal IO size to get best performance*,

What is "best performance"? IOPS or throughput?

io_min is about IOPS. io_opt is about throughput.

I mean throughput here.

Thanks,
Kuai