Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Thu, 7 Aug 2025 11:44:38 +0800

Hi,

在 2025/08/07 10:12, Ming Lei 写道:
On Thu, Aug 07, 2025 at 09:23:24AM +0800, Yu Kuai wrote:
Hi,

在 2025/08/06 21:28, Ming Lei 写道:
On Wed, Aug 06, 2025 at 05:21:32PM +0800, Yu Kuai wrote:
Hi,

在 2025/08/01 19:44, Ming Lei 写道:
Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.

This is done by:

- Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().

- Removing the now-redundant tags->lock from blk_mq_find_and_get_req().

This change improves performance by replacing a spinlock with a more
scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
shost->host_blocked.

Meantime it becomes possible to use blk_mq_in_driver_rw() for io
accounting.

Signed-off-by: Ming Lei<ming.lei@xxxxxxxxxx>
---
    block/blk-mq-tag.c | 12 ++++++++----
    block/blk-mq.c     | 24 ++++--------------------
    2 files changed, 12 insertions(+), 24 deletions(-)

I think it's not good to use blk_mq_in_driver_rw() for io accounting, we
start io accounting from blk_account_io_start(), where such io is not in
driver yet.

In blk_account_io_start(), the current IO is _not_ taken into account in
update_io_ticks() yet.

However, this is exactly what we did from coding for a long time, for
example, consider just one IO:

t1: blk_account_io_start
t2: blk_mq_start_request
t3: blk_account_io_done

The update_io_ticks() is called from blk_account_io_start() and
blk_account_io_done(), the time (t3 - t1) will be accounted into
io_ticks.

That still may not be correct, please see "Documentation/block/stat.rst":

```
io_ticks        milliseconds  total time this block device has been active
```

What I meant is that it doesn't matter wrt. "io accounting from
blk_account_io_start()", because the current io is excluded from `inflight ios` in
update_io_ticks() from blk_account_io_start().

So, unless we move update_io_ticks() to blk_mq_start_request(),
blk_account_io_start() is always we start accouting from, no matter we
use percpu counter or blk_mq_in_driver_rw(), the inflight ios should
be 0 for the first io.

And if we consider more IO, there will be a mess:

t1: IO a: blk_account_io_start
t2: IO b: blk_account_io_start
t3: IO a: blk_mq_start_request
t4: IO b: blk_mq_start_request
t5: IO a: blk_account_io_done
t6: IO b: blk_account_io_done

In the old cases that IO is inflight untill blk_mq_start_request, the
io_ticks accounted is t6 - t2, which is werid. That's the reason I
changed the IO accouting, and consider IO inflight from
blk_account_io_start, and this will unify all the fields in iostat.

In reality implementation may include odd things, but the top thing is that
what/how 'io_ticks' should be defined in theory? same with util%.

Yes, for now I prefer the current defination, let iostat start
everything from blk_account_io_start.

However, I'm fine if we decide to move update_io_ticks to
blk_mq_start_request(). One concern is that does the overhead of req
walking per jiffies acceptable?

Also please look at 'man iostat':

      %util  Percentage  of  elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device
             saturation occurs when this value is close to 100% for devices serving requests serially.  But for devices serving requests  in
             parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

which shows %util in device level, not from queuing IO to complete io from device.

That said the current approach for counting inflight IO via percpu counter
from blk_account_io_start() is not correct from %util viewpoint from
request based driver.

I'll prefer to update the documents, on the one hand, util is not

Can we update every distributed iostat's man page? And we can't refresh
every user's mind about %util's definition in short time.

Also what/how should we update the document to? which is one serious thing.

I never do this before, however, I believe there is a way :) I can dig
deeper after we are in aggrement.

Thanks,
Kuai

Thanks,
Ming

.