Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed

Ming Lei <ming.lei@xxxxxxxxxx> · Mon, 1 Sep 2025 16:44:55 +0800

On Fri, Aug 29, 2025 at 08:37:28AM -0700, Caleb Sander Mateos wrote:
> On Tue, Aug 26, 2025 at 1:33 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> >
> > On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> > > >
> > > > When running test_stress_04.sh, the following warning is triggered:
> > > >
> > > > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
> > > >
> > > > This happens when the daemon is abruptly killed:
> > > >
> > > > - some references may still be held, because registering IO buffer
> > > > doesn't grab ublk char device reference
> > >
> > > Ah, good observation. That's definitely a problem.
> > >
> > > >
> > > > OR
> > > >
> > > > - io->task_registered_buffers won't be cleared because io buffer is
> > > > released from non-daemon context
> > >
> > > I don't think the task_registered_buffers optimization is really
> > > involved here; that's just a different way of tracking the reference
> > > count. Regardless of what task the buffer is registered or
> > > unregistered on, the buffer still counts as 1 reference on the
> > > ublk_io. Summing up the reference counts and making sure they are both
> > > reset to 0 seems like a good approach to me.
> >
> > The warning in ublk_queue_reinit() is triggered because
> >
> > - the reference is not dropped
> >
> > OR
> >
> > - the io buf unregister is done from another task context(kernel wq), so
> > both io->ref and io->task_registered_buffers are not zero, which is started
> > from task_registered_buffers optimization
> 
> Right, but I would consider that to be an outstanding reference. I
> think we're on the same page.
> 
> >
> > >
> > > >
> > > > For zero-copy and auto buffer register modes, I/O reference crosses
> > > > syscalls, so IO reference may not be dropped naturally when ublk server is
> > > > killed abruptly. However, when releasing io_uring context, it is guaranteed
> > > > that the reference is dropped finally, see io_sqe_buffers_unregister() from
> > > > io_ring_ctx_free().
> > > >
> > > > Fix this by adding ublk_drain_io_references() that:
> > > > - Waits for active I/O references dropped in async way by scheduling
> > > >   work function, for avoiding ublk dev and io_uring file's release
> > > >   dependency
> > > > - Reinitializes io->ref and io->task_registered_buffers to clean state
> > > >
> > > > This ensures the reference count state is clean when ublk_queue_reinit()
> > > > is called, preventing the warning and potential use-after-free.
> > >
> > > One scenario I worry about is if the ublk server has already issued
> > > UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
> > > has registered buffer(s). It's possible the ublk server hasn't
> > > finished performing I/O to/from the registered buffers and so the I/O
> > > isn't really complete yet. But when io_uring automatically releases
> > > the registered buffers, the reference count will hit 0 and the ublk
> > > I/O will be completed successfully. There seems to be some data
> > > corruption risk in this scenario.
> >
> > The final io buffer unregister is from freeing io_uring conext in
> > io_uring_release(), any user of this uring context has been done,
> > so it isn't possible that ublk server is performing io with the
> > un-registered buffer.
> 
> Well, the ublk server could have been killed before it could submit an
> I/O using the registered buffer. But it's an existing concern,
> certainly not introduced by your patch. I think you can make a
> reasonable argument that a ublk server shouldn't submit a
> UBLK_IO_COMMIT_AND_FETCH_REQ until it knows the I/O completed
> successfully or with an error, otherwise it won't be able to change
> the result code if the I/O using the registered buffer unexpectedly
> fails.

Yes, it was one trouble because ublk server can do whatever.

However, since your task_registered_buffers optimization, ublk request
won't be completed from freeing uring_ctx context any more, and it is always
aborted from ublk_abort_queue(), so it is failed in case of killed ublk
server.

The situation can be improved by failing request explicitly by setting io->res
as -EIO, then -EIO can be taken first in case of killed ublk server.

> 
> >
> > > But maybe it doesn't make sense for
> > > a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
> > > before knowing whether the zero-copy I/Os succeeded?
> > >
> > > >
> > > > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> > > > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> > > > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> > > > ---
> > > >  drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 92 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index 99abd67b708b..f608c7efdc05 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -239,6 +239,7 @@ struct ublk_device {
> > > >         struct mutex cancel_mutex;
> > > >         bool canceling;
> > > >         pid_t   ublksrv_tgid;
> > > > +       struct delayed_work     exit_work;
> > > >  };
> > > >
> > > >  /* header of ublk_params */
> > > > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
> > > >                 ublk_get_queue(ub, i)->canceling = canceling;
> > > >  }
> > > >
> > > > -static int ublk_ch_release(struct inode *inode, struct file *filp)
> > > > +static void ublk_reset_io_ref(struct ublk_device *ub)
> > > >  {
> > > > -       struct ublk_device *ub = filp->private_data;
> > > > +       int i, j;
> > > > +
> > > > +       if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > > > +                                       UBLK_F_AUTO_BUF_REG)))
> > > > +               return;
> > > > +
> > > > +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > > > +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > > > +
> > > > +               for (j = 0; j < ubq->q_depth; j++) {
> > > > +                       struct ublk_io *io = &ubq->ios[j];
> > > > +                       /*
> > > > +                        * Reinitialize reference counting fields after
> > > > +                        * draining. This ensures clean state for queue
> > > > +                        * reinitialization.
> > > > +                        */
> > > > +                       refcount_set(&io->ref, 0);
> > > > +                       io->task_registered_buffers = 0;
> > > > +               }
> > > > +       }
> > > > +}
> > > > +
> > > > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> > > > +{
> > > > +       int i, j;
> > > > +
> > > > +       if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > > > +                                       UBLK_F_AUTO_BUF_REG)))
> > > > +               return false;
> > > > +
> > > > +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > > > +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > > > +
> > > > +               for (j = 0; j < ubq->q_depth; j++) {
> > > > +                       struct ublk_io *io = &ubq->ios[j];
> > > > +                       unsigned int refs = refcount_read(&io->ref) +
> > > > +                               io->task_registered_buffers;
> > > > +
> > > > +                       /*
> > > > +                        * UBLK_REFCOUNT_INIT or zero means no active
> > > > +                        * reference
> > > > +                        */
> > > > +                       if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> > > > +                               return true;
> > >
> > > It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
> > > UBLK_REFCOUNT_INIT active references. It would be safer to check
> > > UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
> > > needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
> > > count needs to reach 0.
> >
> > It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT
> > any time if the io buffer isn't registered, so it is enough and
> > simpler.
> 
> I think it's theoretically possible to make the reference count
> arbitrarily high with ublk registered buffers, which can cause it to
> actually have UBLK_REFCOUNT_INIT references (not because all
> references are released) or the reference count to overflow back to 0.
> One way to do this would be to register the same I/O's buffer in many
> slots of many io_urings at the same time.

The reference is initialized as UBLK_REFCOUNT_INIT, the only way for
triggering it is to make the counter overflow first.

> Another possibility would be
> to repeatedly register the buffer, use it in an I/O that never
> completes (e.g. recv from an empty socket), and then unregister the
> buffer. But this reference count overflow issue is an existing issue,
> not introduced by the patch.

Yes, overflow can be detected & warned by refcount checking, also it is only
allowed for privileged user, so looks this way is just fine.

Thanks, 
Ming