On Thu, May 15, 2025 at 6:51 PM Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> wrote: > > On Thu, May 15, 2025 at 9:26 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > Commit f40139fde527 ("ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd") > > adds request state check in ublk_cancel_cmd(), and if the request is > > started, skip canceling this uring_cmd. > > > > However, the current uring_cmd may be in ACTIVE state, without block > > request coming to the uring command. Meantime, the cached request in > > tag_set.tags[tag] is recycled and has been delivered to ublk server, > > then this uring_cmd can't be canceled. > > To check my understanding, the scenario you're describing is that the > request has started but also has already been completed by the ublk > server? And this can happen because tag_set.tags[q_id]->rqs[tag] still > points to the old request even after it has completed? Reading through > blk-mq.c, that does seem possible since rqs[tag] is set in > blk_mq_start_request() but not cleared in __blk_mq_end_request(). > > > > > ublk requests are aborted in ublk char device release handler, which > > depends on canceling all ACTIVE uring_cmd. So cause dead loop. > > Do you mean "deadlock"? > > > > > Fix this issue by not taking stale request into account when canceling > > uring_cmd in ublk_cancel_cmd(). > > > > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> > > Closes: https://lore.kernel.org/linux-block/mruqwpf4tqenkbtgezv5oxwq7ngyq24jzeyqy4ixzvivatbbxv@4oh2wzz4e6qn/ > > Fixes: f40139fde527 ("ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd") > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > --- > > drivers/block/ublk_drv.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c > > index f9032076bc06..dc104c025cd5 100644 > > --- a/drivers/block/ublk_drv.c > > +++ b/drivers/block/ublk_drv.c > > @@ -1708,7 +1708,7 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, unsigned tag, > > * that ublk_dispatch_req() is always called > > */ > > req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag); > > - if (req && blk_mq_request_started(req)) > > + if (req && blk_mq_request_started(req) && req->tag == tag) > > Is it possible that req now belongs to a different hctx (q_id)? From a > quick reading of blk-mq.c, it looks like the hctx's requests are > always allocated from the hctx's static_rqs, so I don't think that > should be a problem. Reading req->tag here is probably racy though, as > it's written by blk_mq_rq_ctx_init(), called from > *blk_mq_alloc_request*() on the submitting task. How about checking > blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT instead of > blk_mq_request_started(req) && req->tag == tag? Ah, never mind, I see what you mean. The struct request could have already been reused for a new I/O to the ublk queue and associated with a different tag. And that new request could be inflight even though the old one has completed. Really this condition intends to ask "is this tag active?" I guess you could use sbitmap_test_bit(&tag_set.tags[q_id].bitmap_tags, tag) (exposed through a blk-mq helper) to determine that more directly. But the quick fix probably makes sense. It's technically racy, as the load and store of req->state need to use acquire and release ordering to synchronize the accesses to req->tag, but it's probably unlikely to observe that reordering in practice. Best, Caleb