Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()

Coly Li <colyli@xxxxxxxxxx> · Wed, 16 Jul 2025 16:50:22 +0800

On Wed, Jul 16, 2025 at 02:58:13PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/16 2:02, colyli@xxxxxxxxxx 写道:
> > From: Coly Li <colyli@xxxxxxxxxx>
> > 
> > Currently in md_submit_bio() the incoming request bio is split by
> > bio_split_to_limits() which makes sure the bio won't exceed
> > max_hw_sectors of a specific raid level before senting into its
> > .make_request method.
> > 
> > For raid level 4/5/6 such split method might be problematic and hurt
> > large read/write perforamnce. Because limits.max_hw_sectors are not
> > always aligned to limits.io_opt size, the split bio won't be full
> > stripes covered on all data disks, and will introduce extra read-in I/O.
> > Even the bio's bi_sector is aligned to limits.io_opt size and large
> > enough, the resulted split bio is not size-friendly to corresponding
> > raid456 level.
> > 
> > This patch introduces bio_split_by_io_opt() to solve the above issue,
> > 1, If the incoming bio is not limits.io_opt aligned, split the non-
> >     aligned head part. Then the next one will be aligned.
> > 2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
> >     then try to split a by multiple of limits.io_opt but not exceed
> >     limits.max_hw_sectors.
> > 
> > Then for large bio, the sligned split part will be full-stripes covered
> > to all data disks, no extra read-in I/Os when rmw_level is 0. And for
> > rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
> > for performace as well.
> > 
> > This RFC patch only tests on 8 disks raid5 array with 64KiB chunk size.
> > By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
> > write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
> > If fio bs=488K (exact limits.io_opt size) the peak sequential write
> > throughput can reach 1.51GiB/s.
> > 
> > (Resend to include Christoph and Keith in CC list.)
> > 
> > Signed-off-by: Coly Li <colyli@xxxxxxxxxx>
> > Cc: Yu Kuai <yukuai3@xxxxxxxxxx>
> > Cc: Xiao Ni <xni@xxxxxxxxxx>
> > Cc: Hannes Reinecke <hare@xxxxxxx>
> > Cc: Martin Wilck <mwilck@xxxxxxxx>
> > Cc: Christoph Hellwig <hch@xxxxxx>
> > Cc: Keith Busch <kbusch@xxxxxxxxxx>
> > ---
> >   drivers/md/md.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 62 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 0f03b21e66e4..363cff633af3 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -426,6 +426,67 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
> >   }
> >   EXPORT_SYMBOL(md_handle_request);
> > +static struct bio *bio_split_by_io_opt(struct bio *bio)
> > +{
> > +	sector_t io_opt_sectors, sectors, n;
> > +	struct queue_limits lim;
> > +	struct mddev *mddev;
> > +	struct bio *split;
> > +	int level;
> > +
> > +	mddev = bio->bi_bdev->bd_disk->private_data;
> > +	level = mddev->level;
> > +	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR)
> > +		return bio_split_to_limits(bio);
> 
> There is another patch that provide a helper raid_is_456()
> https://lore.kernel.org/all/20250707165202.11073-3-yukuai@xxxxxxxxxx/
> 
> You might want to use it here.

Copied, I will use raid_is_456() after it gets merged.

> > +
> > +	lim = mddev->gendisk->queue->limits;
> > +	io_opt_sectors = min3(bio_sectors(bio), lim.io_opt >> SECTOR_SHIFT,
> > +			      lim.max_hw_sectors);
> 
> You might want to use max_sectors here, to honor user setting.
> 
> And max_hw_sectors is just for normal read and write, for other IO like
> discard, atomic write, write zero, the limit is different.
> 

Yeah, this is a reason why I want your comments :-) Originally I want to
change bio_split_to_limits(), but the raid5 logic cannot be accessed
there. If I write a clone of bio_split_to_limits(), it seems too heavy
and unncessary.

How about only handle read/write bio here, and let bio_split_to_limits()
to do the rested, like,

	level = mddev->level;
	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR ||
	    (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE))
		return bio_split_to_limits(bio);

> > +
> > +	/* No need to split */
> > +	if (bio_sectors(bio) == io_opt_sectors)
> > +		return bio;
> > +
> 
> If the bio happend to accross two io_opt, do you think it's better to
> split it here? For example:
> 

I assume raid5 code will handle this, the pages of this bio will belong
to different stripe pages belong to different chunks. So no need to
split here. For such condition, it is same handling process as calling
bio_split_to_limits().

> io_opt is 64k(chunk size) * 7 = 448k, issue an IO start from 444k with
> len = 8k. raid5 will have to use 2 stripes to handle such IO.
>

Yes, I assume although this bio is not split, raid5 will have different
cloned bio to map different stripe pages in __add_stripe_bio().

And because they belong to difference chunks, extra read-in for XOR (when
rmw_level == 0) cannot be avoided.

> > +	n = bio->bi_iter.bi_sector;
> > +	sectors = do_div(n, io_opt_sectors);
> > +	/* Aligned to io_opt size and no need to split for radi456 */
> > +	if (!sectors && (bio_sectors(bio) <=  lim.max_hw_sectors))
> > +		return bio;
> 
> I'm confused here, do_div doesn't mean aligned, should bio_offset() be
> taken into consideration? For example, issue an IO start from 4k with
> len = 448 * 2 k, if I read the code correctly, the result is:
> 
> 4 + 896 -> 4 + 896 (not split if within max_sectors)
> 
> What we really expect is:
> 
> 4 + 896 -> 4 + 444, 448 + 448, 892 + 4

Yes you are right. And I do this on purpose, becasue the size of bio
is small (less then max_hw_sectors). Even split the bio as you exampled,
the full-stripes-write in middle doesn't help performance because the
extra read-in will happen for head and tail part when rmw_level == 0.

For small bio (size < io_opt x 2 in this case), there is almost no
performance difference. The performance loss of incorrect bio split will
show up when the bio is large enough and many split bios in middle are
not full-stripes covered.

For bio size < max_hw_sectors, just let raid5 handle it as what it does.
> > +
> > +	if (sectors) {
> > +		/**
> > +		 * Not aligned to io_opt, split
> > +		 * non-aligned head part.
> > +		 */
> > +		sectors = io_opt_sectors - sectors;
> > +	} else {
> > +		/**
> > +		 * Aligned to io_opt, split to the largest multiple
> > +		 * of io_opt within max_hw_sectors, to make full
> > +		 * stripe write/read for underlying raid456 levels.
> > +		 */
> > +		n = lim.max_hw_sectors;
> > +		do_div(n, io_opt_sectors);
> > +		sectors = n * io_opt_sectors;
> 
> roundown() ?

Yes, of course :-)

> > +	}
> > +
> > +	/* Almost won't happen */
> > +	if (unlikely(sectors >= bio_sectors(bio))) {
> > +		pr_warn("%s raid level %d: sectors %llu >= bio_sectors %u, not split\n",
> > +			__func__, level, sectors, bio_sectors(bio));
> > +		return bio;
> > +	}
> > +
> > +	split = bio_split(bio, sectors, GFP_NOIO,
> > +			  &bio->bi_bdev->bd_disk->bio_split);
> > +	if (!split)
> > +		return bio;
> > +	split->bi_opf |= REQ_NOMERGE;
> > +	bio_chain(split, bio);
> > +	submit_bio_noacct(bio);
> > +	return split;
> > +}
> > +
> >   static void md_submit_bio(struct bio *bio)
> >   {
> >   	const int rw = bio_data_dir(bio);
> > @@ -441,7 +502,7 @@ static void md_submit_bio(struct bio *bio)
> >   		return;
> >   	}
> > -	bio = bio_split_to_limits(bio);
> > +	bio = bio_split_by_io_opt(bio);
> >   	if (!bio)
> >   		return;
> > 
>

Thanks for the review.

Coly Li