On Tue, Mar 25, 2025 at 7:30 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > On Tue, Mar 25, 2025 at 08:26:18AM -0700, Caleb Sander Mateos wrote: > > On Mon, Mar 24, 2025 at 6:49 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > > > Add words to explain how zero copy feature works, and why it has to be > > > trusted for handling IO read command. > > > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > > --- > > > Documentation/block/ublk.rst | 28 ++++++++++++++++++++-------- > > > 1 file changed, 20 insertions(+), 8 deletions(-) > > > > > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst > > > index 1e0e7358e14a..33efff25b54d 100644 > > > --- a/Documentation/block/ublk.rst > > > +++ b/Documentation/block/ublk.rst > > > @@ -309,18 +309,30 @@ with specified IO tag in the command data: > > > ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy > > > the server buffer (pages) read to the IO request pages. > > > > > > -Future development > > > -================== > > > - > > > Zero copy > > > --------- > > > > > > -Zero copy is a generic requirement for nbd, fuse or similar drivers. A > > > -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace > > > -can't be remapped any more in kernel with existing mm interfaces. This can > > > -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that > > > -big requests (IO size >= 256 KB) may benefit a lot from zero copy. > > > +ublk zero copy relies on io_uring's fixed kernel buffer, which provides > > > +two APIs: `io_buffer_register_bvec()` and `io_buffer_unregister_bvec`. > > > > nit: missing parentheses after io_buffer_unregister_bvec > > OK > > > > > > + > > > +ublk adds IO command of `UBLK_IO_REGISTER_IO_BUF` to call > > > +`io_buffer_register_bvec()` for ublk server to register client request > > > +buffer into io_uring buffer table, then ublk server can submit io_uring > > > +IOs with the registered buffer index. IO command of `UBLK_IO_UNREGISTER_IO_BUF` > > > +calls `io_buffer_unregister_bvec` to unregister the buffer. > > > > Parentheses missing here too. > > It might be worth adding a note that the registered buffer and any I/O > > that uses it will hold a reference on the ublk request. For a ublk > > server implementer, I think it's useful to know that the buffer needs > > to be unregistered in order to complete the ublk request, and that the > > zero-copy I/O won't corrupt any data even if it completes after the > > buffer is unregistered. > > Good point, how about the following words? > > ``` > The ublk client IO request buffer is guaranteed to be live between calling > `io_buffer_register_bvec()` and `io_buffer_unregister_bvec()`. > > And any io_uring operation which supports this kind of kernel buffer will > grab one reference of the buffer until the operation is completed. > ``` That's definitely better. I think it could be a bit more explicit that the buffer needs to be unregistered in order for the ublk request to complete. > > > > > > + > > > +ublk server implementing zero copy has to be CAP_SYS_ADMIN and be trusted, > > > +because it is ublk server's responsibility to make sure IO buffer filled > > > +with data, and ublk server has to handle short read correctly by returning > > > +correct bytes filled to io buffer. Otherwise, uninitialized kernel buffer > > > +will be exposed to client application. > > > > This isn't specific to zero-copy, right? ublk devices configured with > > UBLK_F_USER_COPY also need to initialize the full request buffer. I > > Right, but it is important for zero copy. Sure. Maybe change "zero copy" to "zero copy or user copy"? > > > would also drop the "handle short read" part; ublk servers don't > > necessarily issue read operations in the backend, so "short read" may > > not be applicable. > > Here 'read' in 'short read' means ublk front READ command, not actual read > done in ublk server. Maybe we can make it more accurate: > > ``` > ..., and ublk server has to return correct result to ublk driver when handling > READ command, and the result has to match with how many bytes filled to the IO > buffer. Otherwise, uninitialized kernel IO buffer will be exposed to client > application. > ``` Thanks for clarifying. That looks great! Best, Caleb