Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order

Damien Le Moal <dlemoal@xxxxxxxxxx> · Fri, 23 May 2025 08:02:38 +0200

On 5/22/25 19:08, Bart Van Assche wrote:
> On 5/21/25 10:12 PM, Damien Le Moal wrote:
>> I am still very confused about how this is possible assuming a well behaved user
>> that actually submits write BIOs in sequence for a zone. That means with a lock
>> around submit_bio() calls. Assuming such user, a large write BIO that is split
>> would have its fragments all processed and added to the target zone plug in
>> order. Another context (or the same context) submitting the next write for that
>> zone would have the same happen, so BIO fragments should not be reordered...
>>
>> So to clarify: are we talking about splits of the BIO that the DM device
>> receives ? Or is it about splits of cloned BIOs that are used to process the
>> BIOs that the DM device received ? The clones are for the underlying device and
>> should not have the zone plugging flag set until the DM target driver submits
>> them, even if the original BIO is flagged with zone plugging. Looking at the bio
>> clone code, the bio flags do not seem to be copied from the source BIO to the
>> clone. So even if the source BIO (the BIO received by the DM device) is flagged
>> with zone write plugging, a clone should not have this flag set until it is
>> submitted.
>>
>> Could you clarify the sequence and BIO flags you see that leads to the issue ?
> 
> Hi Damien,
> 
> In the tests that I ran, F2FS submits bios to a dm driver and the dm
> driver submits these bios to the SCSI disk (sd) driver. F2FS submits

Which DM driver is it ? Does that DM driver have some special work queue
handling of BIO submissions ? Or does is simply remap the BIO and send it down
to the underlying device in the initial submit_bio() context ? If it is the
former case, then that DM driver must enable zone write plugging. If it is the
latter, it should not need zone write plugging and ordering will be handled
correctly throughout the submit_bio() context for the initial DM BIO, assuming
that the submitter does indeed serialize write BIO submissions to a zone. I have
not looked at f2fs code in ages. When I worked on it, there was a mutex to
serialize write issuing to avoid reordering issues...

> bios at the write pointer. If that wouldn't be the case, the following
> code in block/blk-zoned.c would reject these bios:
> 
> 	/*
> 	 * Check for non-sequential writes early as we know that BIOs
> 	 * with a start sector not unaligned to the zone write pointer
> 	 * will fail.
> 	 */
> 	if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
> 		return false;
> 
> If the bio is larger than 1 MiB, it gets split by the block layer after
> it passed through the dm driver and before it is submitted to the sd
> driver. The UFS driver sets max_sectors to 1 MiB. Although UFS host
> controllers support larger requests, this value has been chosen to
> minimize the impact of writes on read latency.

As mentioned above, the splitting and adding to the zone write plug should all
be serialized by the submitter, using whatever mean is appropriate there. As
long as submit_bio() is ongoing processing a large BIO and splitting it, if the
submitter is correctly serialzing writes, I do not see how splitting can result
in reordering...

> Earlier emails in this thread show that the bio splitting below the dm
> driver can cause bio reordering. See also the call stack that is
> available here:
> 
> https://lore.kernel.org/linux-block/47b24ea0-ef8f-441f-b405-a062b986ce93@xxxxxxx/

I asked for clarification in the first place because I still do not understand
what is going on reading that lightly explained backtrace you show in that
email. A more detailed time flow explanation of what is happening and in which
context would very likely clarify exactly what is gong on.

So far, the only thing I can think of is that maybe we need to split BIOs in DM
core before submitting them to the DM driver. But I am reluctant to send such
patch because I cannot justify/expalin its need based on your explanations.

-- 
Damien Le Moal
Western Digital Research