Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed

Caleb Sander Mateos <csander@xxxxxxxxxxxxxxx> · Mon, 25 Aug 2025 22:49:49 -0700

On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
>
> When running test_stress_04.sh, the following warning is triggered:
>
> WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
>
> This happens when the daemon is abruptly killed:
>
> - some references may still be held, because registering IO buffer
> doesn't grab ublk char device reference

Ah, good observation. That's definitely a problem.

>
> OR
>
> - io->task_registered_buffers won't be cleared because io buffer is
> released from non-daemon context

I don't think the task_registered_buffers optimization is really
involved here; that's just a different way of tracking the reference
count. Regardless of what task the buffer is registered or
unregistered on, the buffer still counts as 1 reference on the
ublk_io. Summing up the reference counts and making sure they are both
reset to 0 seems like a good approach to me.

>
> For zero-copy and auto buffer register modes, I/O reference crosses
> syscalls, so IO reference may not be dropped naturally when ublk server is
> killed abruptly. However, when releasing io_uring context, it is guaranteed
> that the reference is dropped finally, see io_sqe_buffers_unregister() from
> io_ring_ctx_free().
>
> Fix this by adding ublk_drain_io_references() that:
> - Waits for active I/O references dropped in async way by scheduling
>   work function, for avoiding ublk dev and io_uring file's release
>   dependency
> - Reinitializes io->ref and io->task_registered_buffers to clean state
>
> This ensures the reference count state is clean when ublk_queue_reinit()
> is called, preventing the warning and potential use-after-free.

One scenario I worry about is if the ublk server has already issued
UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
has registered buffer(s). It's possible the ublk server hasn't
finished performing I/O to/from the registered buffers and so the I/O
isn't really complete yet. But when io_uring automatically releases
the registered buffers, the reference count will hit 0 and the ublk
I/O will be completed successfully. There seems to be some data
corruption risk in this scenario. But maybe it doesn't make sense for
a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
before knowing whether the zero-copy I/Os succeeded?

>
> Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> ---
>  drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 92 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 99abd67b708b..f608c7efdc05 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -239,6 +239,7 @@ struct ublk_device {
>         struct mutex cancel_mutex;
>         bool canceling;
>         pid_t   ublksrv_tgid;
> +       struct delayed_work     exit_work;
>  };
>
>  /* header of ublk_params */
> @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
>                 ublk_get_queue(ub, i)->canceling = canceling;
>  }
>
> -static int ublk_ch_release(struct inode *inode, struct file *filp)
> +static void ublk_reset_io_ref(struct ublk_device *ub)
>  {
> -       struct ublk_device *ub = filp->private_data;
> +       int i, j;
> +
> +       if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> +                                       UBLK_F_AUTO_BUF_REG)))
> +               return;
> +
> +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> +               for (j = 0; j < ubq->q_depth; j++) {
> +                       struct ublk_io *io = &ubq->ios[j];
> +                       /*
> +                        * Reinitialize reference counting fields after
> +                        * draining. This ensures clean state for queue
> +                        * reinitialization.
> +                        */
> +                       refcount_set(&io->ref, 0);
> +                       io->task_registered_buffers = 0;
> +               }
> +       }
> +}
> +
> +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> +{
> +       int i, j;
> +
> +       if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> +                                       UBLK_F_AUTO_BUF_REG)))
> +               return false;
> +
> +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> +               for (j = 0; j < ubq->q_depth; j++) {
> +                       struct ublk_io *io = &ubq->ios[j];
> +                       unsigned int refs = refcount_read(&io->ref) +
> +                               io->task_registered_buffers;
> +
> +                       /*
> +                        * UBLK_REFCOUNT_INIT or zero means no active
> +                        * reference
> +                        */
> +                       if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> +                               return true;

It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
UBLK_REFCOUNT_INIT active references. It would be safer to check
UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
count needs to reach 0.

> +               }
> +       }
> +       return false;
> +}
> +
> +static void ublk_ch_release_work_fn(struct work_struct *work)
> +{
> +       struct ublk_device *ub =
> +               container_of(work, struct ublk_device, exit_work.work);
>         struct gendisk *disk;
>         int i;
>
> +       /*
> +        * For zero-copy and auto buffer register modes, I/O references
> +        * might not be dropped naturally when the daemon is killed, but
> +        * io_uring guarantees that registered bvec kernel buffers are
> +        * unregistered finally when freeing io_uring context, then the
> +        * active references are dropped.
> +        *
> +        * Wait until active references are dropped for avoiding use-after-free
> +        *
> +        * registered buffer may be unregistered in io_ring's release hander,
> +        * so have to wait by scheduling work function for avoiding the two
> +        * file release dependency.
> +        */
> +       if (ublk_has_io_with_active_ref(ub)) {
> +               schedule_delayed_work(&ub->exit_work, 1);
> +               return;
> +       }
> +
> +       ublk_reset_io_ref(ub);

Why the 2 separate loops over nr_hw_queues and q_depth? Could they be
combined into a single nested loop that waits for each ublk_io's
references to be released and then resets its reference counts to 0?
Looks like the ub->dev_info.flags checks could also be consolidated.

> +
>         /*
>          * disk isn't attached yet, either device isn't live, or it has
>          * been removed already, so we needn't to do anything
> @@ -1673,6 +1746,23 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
>         ublk_reset_ch_dev(ub);
>  out:
>         clear_bit(UB_STATE_OPEN, &ub->state);
> +
> +       /* put the reference grabbed in ublk_ch_release() */
> +       ublk_put_device(ub);
> +}
> +
> +static int ublk_ch_release(struct inode *inode, struct file *filp)
> +{
> +       struct ublk_device *ub = filp->private_data;
> +
> +       /*
> +        * Grab ublk device reference, so it won't be gone until we are
> +        * really released from work function.
> +        */
> +       ub = ublk_get_device(ub);

Can taking a reference fail here? If so, the NULL return value would
need to be handled. If not, the "ub =" could be dropped.

Best,
Caleb

> +
> +       INIT_DELAYED_WORK(&ub->exit_work, ublk_ch_release_work_fn);
> +       schedule_delayed_work(&ub->exit_work, 0);
>         return 0;
>  }
>
> --
> 2.47.0
>