Re: [PATCH] ublk: fix dead loop when canceling io command

Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> · Thu, 15 May 2025 18:51:52 -0700

On Thu, May 15, 2025 at 9:26 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
>
> Commit f40139fde527 ("ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd")
> adds request state check in ublk_cancel_cmd(), and if the request is
> started, skip canceling this uring_cmd.
>
> However, the current uring_cmd may be in ACTIVE state, without block
> request coming to the uring command. Meantime, the cached request in
> tag_set.tags[tag] is recycled and has been delivered to ublk server,
> then this uring_cmd can't be canceled.

To check my understanding, the scenario you're describing is that the
request has started but also has already been completed by the ublk
server? And this can happen because tag_set.tags[q_id]->rqs[tag] still
points to the old request even after it has completed? Reading through
blk-mq.c, that does seem possible since rqs[tag] is set in
blk_mq_start_request() but not cleared in __blk_mq_end_request().

>
> ublk requests are aborted in ublk char device release handler, which
> depends on canceling all ACTIVE uring_cmd. So cause dead loop.

Do you mean "deadlock"?

>
> Fix this issue by not taking stale request into account when canceling
> uring_cmd in ublk_cancel_cmd().
>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>
> Closes: https://lore.kernel.org/linux-block/mruqwpf4tqenkbtgezv5oxwq7ngyq24jzeyqy4ixzvivatbbxv@4oh2wzz4e6qn/
> Fixes: f40139fde527 ("ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd")
> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> ---
>  drivers/block/ublk_drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index f9032076bc06..dc104c025cd5 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1708,7 +1708,7 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, unsigned tag,
>          * that ublk_dispatch_req() is always called
>          */
>         req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> -       if (req && blk_mq_request_started(req))
> +       if (req && blk_mq_request_started(req) && req->tag == tag)

Is it possible that req now belongs to a different hctx (q_id)? From a
quick reading of blk-mq.c, it looks like the hctx's requests are
always allocated from the hctx's static_rqs, so I don't think that
should be a problem. Reading req->tag here is probably racy though, as
it's written by blk_mq_rq_ctx_init(), called from
*blk_mq_alloc_request*() on the submitting task. How about checking
blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT instead of
blk_mq_request_started(req) && req->tag == tag?

Best,
Caleb

>                 return;
>
>         spin_lock(&ubq->cancel_lock);
> --
> 2.47.1
>