On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote: > On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > When running test_stress_04.sh, the following warning is triggered: > > > > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv] > > > > This happens when the daemon is abruptly killed: > > > > - some references may still be held, because registering IO buffer > > doesn't grab ublk char device reference > > Ah, good observation. That's definitely a problem. > > > > > OR > > > > - io->task_registered_buffers won't be cleared because io buffer is > > released from non-daemon context > > I don't think the task_registered_buffers optimization is really > involved here; that's just a different way of tracking the reference > count. Regardless of what task the buffer is registered or > unregistered on, the buffer still counts as 1 reference on the > ublk_io. Summing up the reference counts and making sure they are both > reset to 0 seems like a good approach to me. The warning in ublk_queue_reinit() is triggered because - the reference is not dropped OR - the io buf unregister is done from another task context(kernel wq), so both io->ref and io->task_registered_buffers are not zero, which is started from task_registered_buffers optimization > > > > > For zero-copy and auto buffer register modes, I/O reference crosses > > syscalls, so IO reference may not be dropped naturally when ublk server is > > killed abruptly. However, when releasing io_uring context, it is guaranteed > > that the reference is dropped finally, see io_sqe_buffers_unregister() from > > io_ring_ctx_free(). > > > > Fix this by adding ublk_drain_io_references() that: > > - Waits for active I/O references dropped in async way by scheduling > > work function, for avoiding ublk dev and io_uring file's release > > dependency > > - Reinitializes io->ref and io->task_registered_buffers to clean state > > > > This ensures the reference count state is clean when ublk_queue_reinit() > > is called, preventing the warning and potential use-after-free. > > One scenario I worry about is if the ublk server has already issued > UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still > has registered buffer(s). It's possible the ublk server hasn't > finished performing I/O to/from the registered buffers and so the I/O > isn't really complete yet. But when io_uring automatically releases > the registered buffers, the reference count will hit 0 and the ublk > I/O will be completed successfully. There seems to be some data > corruption risk in this scenario. The final io buffer unregister is from freeing io_uring conext in io_uring_release(), any user of this uring context has been done, so it isn't possible that ublk server is performing io with the un-registered buffer. > But maybe it doesn't make sense for > a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code > before knowing whether the zero-copy I/Os succeeded? > > > > > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec") > > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task") > > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task") > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > --- > > drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 92 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c > > index 99abd67b708b..f608c7efdc05 100644 > > --- a/drivers/block/ublk_drv.c > > +++ b/drivers/block/ublk_drv.c > > @@ -239,6 +239,7 @@ struct ublk_device { > > struct mutex cancel_mutex; > > bool canceling; > > pid_t ublksrv_tgid; > > + struct delayed_work exit_work; > > }; > > > > /* header of ublk_params */ > > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling) > > ublk_get_queue(ub, i)->canceling = canceling; > > } > > > > -static int ublk_ch_release(struct inode *inode, struct file *filp) > > +static void ublk_reset_io_ref(struct ublk_device *ub) > > { > > - struct ublk_device *ub = filp->private_data; > > + int i, j; > > + > > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | > > + UBLK_F_AUTO_BUF_REG))) > > + return; > > + > > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { > > + struct ublk_queue *ubq = ublk_get_queue(ub, i); > > + > > + for (j = 0; j < ubq->q_depth; j++) { > > + struct ublk_io *io = &ubq->ios[j]; > > + /* > > + * Reinitialize reference counting fields after > > + * draining. This ensures clean state for queue > > + * reinitialization. > > + */ > > + refcount_set(&io->ref, 0); > > + io->task_registered_buffers = 0; > > + } > > + } > > +} > > + > > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub) > > +{ > > + int i, j; > > + > > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | > > + UBLK_F_AUTO_BUF_REG))) > > + return false; > > + > > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { > > + struct ublk_queue *ubq = ublk_get_queue(ub, i); > > + > > + for (j = 0; j < ubq->q_depth; j++) { > > + struct ublk_io *io = &ubq->ios[j]; > > + unsigned int refs = refcount_read(&io->ref) + > > + io->task_registered_buffers; > > + > > + /* > > + * UBLK_REFCOUNT_INIT or zero means no active > > + * reference > > + */ > > + if (refs != UBLK_REFCOUNT_INIT && refs != 0) > > + return true; > > It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having > UBLK_REFCOUNT_INIT active references. It would be safer to check > UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count > needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference > count needs to reach 0. It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT any time if the io buffer isn't registered, so it is enough and simpler. > > > + } > > + } > > + return false; > > +} > > + > > +static void ublk_ch_release_work_fn(struct work_struct *work) > > +{ > > + struct ublk_device *ub = > > + container_of(work, struct ublk_device, exit_work.work); > > struct gendisk *disk; > > int i; > > > > + /* > > + * For zero-copy and auto buffer register modes, I/O references > > + * might not be dropped naturally when the daemon is killed, but > > + * io_uring guarantees that registered bvec kernel buffers are > > + * unregistered finally when freeing io_uring context, then the > > + * active references are dropped. > > + * > > + * Wait until active references are dropped for avoiding use-after-free > > + * > > + * registered buffer may be unregistered in io_ring's release hander, > > + * so have to wait by scheduling work function for avoiding the two > > + * file release dependency. > > + */ > > + if (ublk_has_io_with_active_ref(ub)) { > > + schedule_delayed_work(&ub->exit_work, 1); > > + return; > > + } > > + > > + ublk_reset_io_ref(ub); > > Why the 2 separate loops over nr_hw_queues and q_depth? Could they be > combined into a single nested loop that waits for each ublk_io's > references to be released and then resets its reference counts to 0? > Looks like the ub->dev_info.flags checks could also be consolidated. This way looks more readable, otherwise ublk_has_io_with_active_ref() has to check and reset. Not a problem, I can move it to one single helper. > > > + > > /* > > * disk isn't attached yet, either device isn't live, or it has > > * been removed already, so we needn't to do anything > > @@ -1673,6 +1746,23 @@ static int ublk_ch_release(struct inode *inode, struct file *filp) > > ublk_reset_ch_dev(ub); > > out: > > clear_bit(UB_STATE_OPEN, &ub->state); > > + > > + /* put the reference grabbed in ublk_ch_release() */ > > + ublk_put_device(ub); > > +} > > + > > +static int ublk_ch_release(struct inode *inode, struct file *filp) > > +{ > > + struct ublk_device *ub = filp->private_data; > > + > > + /* > > + * Grab ublk device reference, so it won't be gone until we are > > + * really released from work function. > > + */ > > + ub = ublk_get_device(ub); > > Can taking a reference fail here? If so, the NULL return value would > need to be handled. If not, the "ub =" could be dropped. Of course it has to get a reference, otherwise it isn't safe to use `ub` is the release handler, I will drop "ub =". Thanks, Ming