Re: bio segment constraints

Sean Anderson <seanga2@xxxxxxxxx> · Mon, 7 Apr 2025 10:14:28 -0400

On 4/7/25 03:10, Hannes Reinecke wrote:
On 4/6/25 21:40, Sean Anderson wrote:
Hi all,

I'm not really sure what guarantees the block layer makes regarding the
segments in a bio as part of a request submitted to a block driver. As
far as I can tell this is not documented anywhere. In particular,

- Is bv_len aligned to SECTOR_SIZE?

The block layer always uses a 512 byte sector size, so yes.

- To logical_sector_size?

Not necessarily. Bvecs are a consecutive list of byte ranges which
make up the data portion of a bio.
The logical sector size is a property of the request queue, which is
applied when a request is formed from one or several bios.
For the request the overall length need to be a multiple of the logical
sector size, but not necessarily the individual bios.

Oh, so this is worse than I thought. So if you care about e.g. only submitting
I/O in units of logical_block_size, you have to combine segments across the
entire request.

- What if logical_sector_size > PAGE_SIZE?

See above.

- What about bv_offset?

Same story. The eventual request needs to observe that the offset
and the length is aligned to the logical block size, but the individual
bios might not.

- Is it possible to have a bio where the total length is a multiple of
   logical_sector_size, but the data is split across several segments
   where each segment is a multiple of SECTOR_SIZE?

Sure.

- Is is possible to have segments not even aligned to SECTOR_SIZE?

Nope.

- Can I somehow request to only get segments with bv_len aligned to
   logical_sector_size? Or do I need to do my own coalescing and bounce
   buffering for that?

The driver surely can. You should be able to set 'max_segment_size' to
the logical block size, and that should give you what you want.

But couldn't I get segments smaller than that? max_segment_size seems like
it would only restrict the maximum size, leaving the possibility open for
smaller segments.

I've been reading some drivers (as well as stuff in block/) to try and
figure things out, but it's hard to figure out all the places where
constraints are enforced. In particular, I've read several drivers that
make some big assumptions (which might be bugs?) For example, in
drivers/mtd/mtd_blkdevs.c, do_blktrans_request looks like:

In general, the block layer has two major data items, bios and requests.
'struct bio' is the central structure for any 'upper' layers to submit
data (via the 'submit_bio()' function), and 'struct request' is the
central structure for drivers to fetch data for submission to the
hardware (via the 'queue_rq()' request_queue callback).
And the task of the block layer is to convert 'struct bio' into
'struct request'.

[ .. ]

For context, tr->blkshift is either 512 or 4096, depending on the
backend. From what I can tell, this code assumes the following:

mtd is probably not a good examples, as MTD has it's own set of limitations which might result in certain shortcuts to be taken.

Well, I want to write a block driver on top of MTD, so it's a pretty good
example for my purposes :P

- There is only one bio in a request. This one is a bit of a soft
   assumption since we should only flush the pages in the bio and not the
   whole request otherwise.
- There is only one segment in a bio. This one could be reasonable if
   max_segments was set to 1, but it's not as far as I can tell. So I
   guess we just go off the end of the bio if there's a second segment?
- The data is in lowmem OR bv_offset + bv_len <= PAGE_SIZE. kmap() only
   maps a single page, so if we go past one page we end up in adjacent
   kmapped pages.

Well, that code _does_ look suspicious. It really should be converted
to using the iov iterators.

I had a look at this, but the API isn't documented so I wasn't sure what
I would get out of it. I'll have a closer look.

But then again, it _might_ be okay if there are underlying MTD
restrictions which would devolve into MTD only having a single bvec.

The underlying restriction is that the MTD API expects a buffer that has
contiguous kernel virtual addresses. The driver will do bounce-buffering
if wants to do DMA and virt_addr_valid is false. The mtd_blkdevs driver
promises to submit buffers of size tr->blksize to the underlying bltrans
driver. This whole thing is not very efficient if the MTD driver can do
scatter-gather DMA, but that's not the API...

Maybe I should just vmap the entire request?

--Sean