On Fri, Aug 29, 2025 at 08:37:28AM -0700, Caleb Sander Mateos wrote: > On Tue, Aug 26, 2025 at 1:33 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote: > > > On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > > > > > When running test_stress_04.sh, the following warning is triggered: > > > > > > > > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv] > > > > > > > > This happens when the daemon is abruptly killed: > > > > > > > > - some references may still be held, because registering IO buffer > > > > doesn't grab ublk char device reference > > > > > > Ah, good observation. That's definitely a problem. > > > > > > > > > > > OR > > > > > > > > - io->task_registered_buffers won't be cleared because io buffer is > > > > released from non-daemon context > > > > > > I don't think the task_registered_buffers optimization is really > > > involved here; that's just a different way of tracking the reference > > > count. Regardless of what task the buffer is registered or > > > unregistered on, the buffer still counts as 1 reference on the > > > ublk_io. Summing up the reference counts and making sure they are both > > > reset to 0 seems like a good approach to me. > > > > The warning in ublk_queue_reinit() is triggered because > > > > - the reference is not dropped > > > > OR > > > > - the io buf unregister is done from another task context(kernel wq), so > > both io->ref and io->task_registered_buffers are not zero, which is started > > from task_registered_buffers optimization > > Right, but I would consider that to be an outstanding reference. I > think we're on the same page. > > > > > > > > > > > > > > For zero-copy and auto buffer register modes, I/O reference crosses > > > > syscalls, so IO reference may not be dropped naturally when ublk server is > > > > killed abruptly. However, when releasing io_uring context, it is guaranteed > > > > that the reference is dropped finally, see io_sqe_buffers_unregister() from > > > > io_ring_ctx_free(). > > > > > > > > Fix this by adding ublk_drain_io_references() that: > > > > - Waits for active I/O references dropped in async way by scheduling > > > > work function, for avoiding ublk dev and io_uring file's release > > > > dependency > > > > - Reinitializes io->ref and io->task_registered_buffers to clean state > > > > > > > > This ensures the reference count state is clean when ublk_queue_reinit() > > > > is called, preventing the warning and potential use-after-free. > > > > > > One scenario I worry about is if the ublk server has already issued > > > UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still > > > has registered buffer(s). It's possible the ublk server hasn't > > > finished performing I/O to/from the registered buffers and so the I/O > > > isn't really complete yet. But when io_uring automatically releases > > > the registered buffers, the reference count will hit 0 and the ublk > > > I/O will be completed successfully. There seems to be some data > > > corruption risk in this scenario. > > > > The final io buffer unregister is from freeing io_uring conext in > > io_uring_release(), any user of this uring context has been done, > > so it isn't possible that ublk server is performing io with the > > un-registered buffer. > > Well, the ublk server could have been killed before it could submit an > I/O using the registered buffer. But it's an existing concern, > certainly not introduced by your patch. I think you can make a > reasonable argument that a ublk server shouldn't submit a > UBLK_IO_COMMIT_AND_FETCH_REQ until it knows the I/O completed > successfully or with an error, otherwise it won't be able to change > the result code if the I/O using the registered buffer unexpectedly > fails. Yes, it was one trouble because ublk server can do whatever. However, since your task_registered_buffers optimization, ublk request won't be completed from freeing uring_ctx context any more, and it is always aborted from ublk_abort_queue(), so it is failed in case of killed ublk server. The situation can be improved by failing request explicitly by setting io->res as -EIO, then -EIO can be taken first in case of killed ublk server. > > > > > > But maybe it doesn't make sense for > > > a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code > > > before knowing whether the zero-copy I/Os succeeded? > > > > > > > > > > > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec") > > > > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task") > > > > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task") > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > > > --- > > > > drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++- > > > > 1 file changed, 92 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c > > > > index 99abd67b708b..f608c7efdc05 100644 > > > > --- a/drivers/block/ublk_drv.c > > > > +++ b/drivers/block/ublk_drv.c > > > > @@ -239,6 +239,7 @@ struct ublk_device { > > > > struct mutex cancel_mutex; > > > > bool canceling; > > > > pid_t ublksrv_tgid; > > > > + struct delayed_work exit_work; > > > > }; > > > > > > > > /* header of ublk_params */ > > > > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling) > > > > ublk_get_queue(ub, i)->canceling = canceling; > > > > } > > > > > > > > -static int ublk_ch_release(struct inode *inode, struct file *filp) > > > > +static void ublk_reset_io_ref(struct ublk_device *ub) > > > > { > > > > - struct ublk_device *ub = filp->private_data; > > > > + int i, j; > > > > + > > > > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | > > > > + UBLK_F_AUTO_BUF_REG))) > > > > + return; > > > > + > > > > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { > > > > + struct ublk_queue *ubq = ublk_get_queue(ub, i); > > > > + > > > > + for (j = 0; j < ubq->q_depth; j++) { > > > > + struct ublk_io *io = &ubq->ios[j]; > > > > + /* > > > > + * Reinitialize reference counting fields after > > > > + * draining. This ensures clean state for queue > > > > + * reinitialization. > > > > + */ > > > > + refcount_set(&io->ref, 0); > > > > + io->task_registered_buffers = 0; > > > > + } > > > > + } > > > > +} > > > > + > > > > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub) > > > > +{ > > > > + int i, j; > > > > + > > > > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | > > > > + UBLK_F_AUTO_BUF_REG))) > > > > + return false; > > > > + > > > > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { > > > > + struct ublk_queue *ubq = ublk_get_queue(ub, i); > > > > + > > > > + for (j = 0; j < ubq->q_depth; j++) { > > > > + struct ublk_io *io = &ubq->ios[j]; > > > > + unsigned int refs = refcount_read(&io->ref) + > > > > + io->task_registered_buffers; > > > > + > > > > + /* > > > > + * UBLK_REFCOUNT_INIT or zero means no active > > > > + * reference > > > > + */ > > > > + if (refs != UBLK_REFCOUNT_INIT && refs != 0) > > > > + return true; > > > > > > It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having > > > UBLK_REFCOUNT_INIT active references. It would be safer to check > > > UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count > > > needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference > > > count needs to reach 0. > > > > It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT > > any time if the io buffer isn't registered, so it is enough and > > simpler. > > I think it's theoretically possible to make the reference count > arbitrarily high with ublk registered buffers, which can cause it to > actually have UBLK_REFCOUNT_INIT references (not because all > references are released) or the reference count to overflow back to 0. > One way to do this would be to register the same I/O's buffer in many > slots of many io_urings at the same time. The reference is initialized as UBLK_REFCOUNT_INIT, the only way for triggering it is to make the counter overflow first. > Another possibility would be > to repeatedly register the buffer, use it in an I/O that never > completes (e.g. recv from an empty socket), and then unregister the > buffer. But this reference count overflow issue is an existing issue, > not introduced by the patch. Yes, overflow can be detected & warned by refcount checking, also it is only allowed for privileged user, so looks this way is just fine. Thanks, Ming