Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 13 Jun 2025 09:18:06 +0800

On Thu, Jun 12, 2025 at 07:38:01AM -0700, Caleb Sander Mateos wrote:
> On Wed, Jun 11, 2025 at 8:16 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> >
> > On Wed, Jun 11, 2025 at 08:54:53AM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Jun 9, 2025 at 7:07 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> > > > > On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> > > > > >
> > > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> > > > >
> > > > > Thanks, this is a nice explanation. Just a few suggestions.
> > > > >
> > > > > > ---
> > > > > >  Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 67 insertions(+)
> > > > > >
> > > > > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > > > > > index c368e1081b41..16ffca54eed4 100644
> > > > > > --- a/Documentation/block/ublk.rst
> > > > > > +++ b/Documentation/block/ublk.rst
> > > > > > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > > > > >  parameter of `struct ublk_param_segment` with backend for avoiding
> > > > > >  unnecessary IO split, which usually hurts io_uring performance.
> > > > > >
> > > > > > +Auto Buffer Registration
> > > > > > +------------------------
> > > > > > +
> > > > > > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > > > > > +and unregistration for I/O requests, which simplifies the buffer management
> > > > > > +process and reduces overhead in the ublk server implementation.
> > > > > > +
> > > > > > +This is another feature flag for using zero copy, and it is compatible with
> > > > > > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > > > > > +
> > > > > > +Feature Overview
> > > > > > +~~~~~~~~~~~~~~~~
> > > > > > +
> > > > > > +This feature automatically registers request buffers to the io_uring context
> > > > > > +before delivering I/O commands to the ublk server and unregisters them when
> > > > > > +completing I/O commands. This eliminates the need for manual buffer
> > > > > > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > > > > > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > > > > > +can avoid dependency on the two uring_cmd operations.
> > > > > > +
> > > > > > +This way not only simplifies ublk server implementation, but also makes
> > > > > > +concurrent IO handling becomes possible.
> > > > >
> > > > > I'm not sure what "concurrent IO handling" refers to. Any ublk server
> > > > > can handle incoming I/O requests concurrently, regardless of what
> > > > > features it has enabled. Do you mean it avoids the need for linked
> > > > > io_uring requests to properly order buffer registration and
> > > > > unregistration with the I/O operations using the registered buffer?
> > > >
> > > > Yes, if io_uring OPs depends on buffer registering & unregistering, these
> > > > OPs can't be issued concurrently any more, that is one io_uring constraint.
> > > >
> > > > I will add the above words.
> > > >
> > > > >
> > > > > > +
> > > > > > +Usage Requirements
> > > > > > +~~~~~~~~~~~~~~~~~~
> > > > > > +
> > > > > > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > > > > > +   used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > > > > > +
> > > > > > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > > > > > +   unregistration is required.
> > > > >
> > > > > nit: don't think this needs to be a separate point, could be combined with (1).
> > > >
> > > > OK.
> > > >
> > > > >
> > > > > > +
> > > > > > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > > > > > +   following structure::
> > > > >
> > > > > nit: extra ":"
> > > >
> > > > In reStructuredText (reST), the double colon :: serves as a literal block marker to
> > > > indicate preformatted text.
> > > >
> > > > >
> > > > > > +
> > > > > > +    struct ublk_auto_buf_reg {
> > > > > > +        __u16 index;      /* Buffer index for registration */
> > > > > > +        __u8 flags;       /* Registration flags */
> > > > > > +        __u8 reserved0;   /* Reserved for future use */
> > > > > > +        __u32 reserved1;  /* Reserved for future use */
> > > > > > +    };
> > > > >
> > > > > Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> > > > > ambiguous how this struct is "passed" in sqe->addr.
> > > >
> > > > OK
> > > >
> > > > >
> > > > > > +
> > > > > > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > > > > > +
> > > > > > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > > > > > +
> > > > > > +Fallback Behavior
> > > > > > +~~~~~~~~~~~~~~~~~
> > > > > > +
> > > > > > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > > +
> > > > > > +1. If auto buffer registration fails:
> > > > >
> > > > > I would switch these. Both (1) and (2) refer to when auto buffer
> > > > > registration fails. So I would expect something like:
> > > > >
> > > > > If auto buffer registration fails:
> > > > >
> > > > > 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > ...
> > > > > 2. If fallback is not enabled:
> > > > > ...
> > > > >
> > > > > > +   - The uring_cmd is completed
> > > > >
> > > > > Maybe add "without registering the request buffer"?
> > > > >
> > > > > > +   - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > > > > > +   - The ublk server must manually register the buffer
> > > > >
> > > > > Only if it wants a registered buffer for the ublk request. Technically
> > > > > the ublk server could decide to fall back on user-copy, for example.
> > > >
> > > > Good catch!
> > > >
> > > > >
> > > > > > +
> > > > > > +2. If fallback is not enabled:
> > > > > > +   - The ublk I/O request fails silently
> > > > >
> > > > > "silently" is a bit ambiguous. It's certainly not silent to the
> > > > > application submitting the ublk I/O. Maybe say that the ublk I/O
> > > > > request fails and no uring_cmd is completed to the ublk server?
> > > >
> > > > Yes, but the document focus on ublk side, and the client is generic
> > > > for every driver, so I guess it may be fine.
> > > >
> > > > >
> > > > > > +
> > > > > > +Limitations
> > > > > > +~~~~~~~~~~~
> > > > > > +
> > > > > > +- Requires same ``io_ring_ctx`` for all operations
> > > > >
> > > > > Another limitation that prevents us from adopting the auto buffer
> > > > > registration feature is the need to reserve a unique buffer table
> > > > > index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> > > > > buffer table has a max size of 16K (could potentially be increased to
> > > > > 64K), this limit is easily reached when there are a large number of
> > > > > ublk devices or the ublk queue depth is large. I think we could remove
> > > > > this limitation in the future by adding support for allocating buffer
> > > > > indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
> > > >
> > > > OK.
> > > >
> > > > But I guess it isn't big deal in reality since the task context should
> > > > be saturated easily with so big setting.
> > >
> > > I don't know about your "reality" but it's certainly a big deal for us :)
> > > To reduce contention on the blk-mq queues for the application
> > > submitting I/O to the ublk devices, we want a large number of queues
> > > for each ublk device. But we also want a large queue depth for each
> > > individual queue to avoid the async request allocation path in case
> > > any one application thread issues a lot of concurrent I/O to a single
> > > ublk device. And we have 128 ublk devices, which again all want large
> > > queue depths in case the application sends a lot of I/O to a single
> > > ublk device. The result is that concurrently each ublk server thread
> > > fetches 512K ublk I/Os, which is significantly above the io_ring_ctx
> > > buffer table limit.
> >
> > Yes, you can setup 512K I/Os in single task/io_uring context, but how many
> > can be actively handled during unit time? The number could be much less than
> > 512k or 16K, because it is a single pthread/io_uring/cpu core, which may be
> > saturated easily, so most of these IOs may wait somewhere for cpu or whatever
> > resource.
> 
> Yes, that's exactly my point. Our ublk server only allocates enough
> resources to handle 4K concurrent I/Os per thread. But since we don't
> know which ublk devices or queues might receive the I/Os, we have to
> fetch a queue depth of 4K on *every* ublk device queue. Perhaps the
> batched approach you're working on will help here. But for now, the
> total number of fetched ublk I/Os is an obstacle to adopting auto
> buffer registration.

oops, I forget the point that buffer index has to be provided beforehand,
that is really one limit for your case with too many IOs in single uring
context.

The batched approach may not help too because the model is to issue command
beforehand for fetching new io command.

> And waiting to allocate the buffer index until an
> incoming I/O actually needs to register a buffer seems like a
> straightforward way to avoid this obstacle.

One way is to rely on bpf program to allocate & provide buffer index via
struct_ops, which can be called exactly before registering & unregistering
io buffer. The concept should be simple, but the whole implementation may
take some effort(most are boiler plate).

thanks,
Ming