Re: [PATCH 2/2] ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd

Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> · Wed, 23 Apr 2025 09:48:55 -0700

On Wed, Apr 23, 2025 at 8:39 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
>
> On Wed, Apr 23, 2025 at 08:08:17AM -0700, Caleb Sander Mateos wrote:
> > On Wed, Apr 23, 2025 at 2:24 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> > >
> > > ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but
> > > we may have scheduled task work via io_uring_cmd_complete_in_task() for
> > > dispatching request, then kernel crash can be triggered.
> > >
> > > Fix it by not trying to canceling the command if ublk block request is
> > > coming to this slot.
> > >
> > > Reported-by: Jared Holzman <jholzman@xxxxxxxxxx>
> > > Closes: https://lore.kernel.org/linux-block/d2179120-171b-47ba-b664-23242981ef19@xxxxxxxxxx/
> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> > > ---
> > >  drivers/block/ublk_drv.c | 37 +++++++++++++++++++++++++++++++------
> > >  1 file changed, 31 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index c4d4be4f6fbd..fbfb5b815c8d 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -1334,6 +1334,12 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> > >         if (res != BLK_STS_OK)
> > >                 return res;
> > >
> > > +       /*
> > > +        * Order writing to rq->state in blk_mq_start_request() and
> > > +        * reading ubq->canceling, see comment in ublk_cancel_command()
> > > +        * wrt. the pair barrier.
> > > +        */
> > > +       smp_mb();
> >
> > Adding an mfence to every ublk I/O would be really unfortunate. Memory
> > barriers are very expensive in a system with a lot of CPUs. Why can't
>
> I believe perf effect from the little smp_mb() may not be observed, actually
> there are several main contributions for ublk perf per my last profiling:

I have seen a single mfence instruction in the I/O path account for
multiple percent of the CPU profile in the past. The cost of a fence
scales superlinearly with the number of CPUs, so it can be a real
parallelism bottleneck. I'm not opposed to the memory barrier if it's
truly necessary for correctness, but I would love to consider any
alternatives.

I have been looking at ublk zero-copy CPU profiles recently, and there
the most expensive instructions there are the atomic
reference-counting in ublk_get_req_ref()/ublk_put_req_ref(). I have
some ideas to reduce that overhead.

Best,
Caleb