On Sun, Jan 19, 2025 at 10:21:45PM +0200, mcaju95@xxxxxxxxx wrote: > Greetings, > > I've been thinking about a POSIX-like API that would allow > read/write/send/recv to be zero-copy instead of being buffered. As such, > storage devices and network sockets can have data transferred to and from > them directly to a user-space application's buffers. Hi Mihai, You're proposing a particular API. Do you have a kernel side implementation of something along these lines? Do you have a particular user space use case of your own in mind, or have you spoken to any potential users? > My focus was initially on network stacks and I drew inspiration from DPDK. > I'm also aware of some work underway on extending io_uring to support zero > copy. I've not really been following io_uring work. Can you summarize the status of their zero copy support and the advantages that this new API would bring? thanks, -serge > A draft API would work as follows: > * The application fills-out a series of iovec's with buffers in its own > memory that can store data from protocols such as TCP or UDP. These iovec's > will serve as hints that will tell the network stack that it can definitely > map a part of a frame's contents into the described buffers. For example, an > iovec may contain { .iov_base = 0x4000, .iov_len = 0xa000 }. In this case, > the data payload may end-up anywhere between 0x4000 and 0xe000 and after the > syscall, its fields will be overwritten to something like { .iov_base = > 0x4036, .iov_len = 1460 } > * In order to receive packets, the application calls readv or a readv-like > syscall and its array of iovec will be modified to point to data payloads. > Given that their pages will be mapped directly to user-space, some header > fields or tail-room may have to be zero-ed out before being mapped, in order > to prevent information leaks. Anny array of iovec's passed to one such readv > syscall should be checked for sanity such as being able to hold data > payloads in corner cases, not overlap with each-other and hold values that > would prove to map pages to. > * The return value would be the number of data payloads that have been > populated. Only the first such elements in the provided array would end-up > containing data payloads. > * The syscall's prototype would be quite identical to that of readv, except > that iov would not be a const struct iovec *, but just a struct iovec * and > the return type would be modified. Like so: > int zc_readv(int fd, struct iovec *iov, int iovcnt); > > * In the case of write's a struct iovec may not suffice as the provided > buffers should not only provide the location and size of data to be sent, > but also the guarantee that the buffers have sufficient head and tail room. > A hackish syscall would look like so: > int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt); > * While the first iovec should describe the entire memory area available to > a packet, including enough head and tail room for headers and CRC's or other > fields specific to the NIC, the second should describe a sub-buffer that > holds the data to be written. > * Again, sanity checks should be performed on the entire array, for things > like having enough room for other fields, not overlapping, proper alignment, > ability to DMA to a device, etc. > * After calling zc_writev the pages associated with the provided iovec's are > immediately swapped for zero-pages to avoid data-leaks. > * For writes, arbitrary physical pages may not work for every NIC as some > are bound by 32bit addressing constrains on the PCIe bus, etc. As such the > application would have to manage a memory pool associated with each > file-descriptor(possibly NIC) that would contain memory that is physically > mapped to areas that can be DMA'ed to the proper devices. For example one > may mmap the file-descriptor to obtain a pool of a certain size for this > purpose. > > This concept can be extended to storage devices, unfortunately I am > unfamiliar with NVMe and SCSI so I can only guess that they work in a > similar manner to NIC rings, in that data can be written and read to > arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to zc_read > and zc_write can be used on file descriptors pointing to storage devices to > fetch or write sectors that contain data belonging to files. Some data > should be zeroed-out in this case as well, as sectors more often that not > will contain data that does not belong to the intended files. > > For example one can mix such syscalls to read directly from storage into NIC > buffers, providing in-place encryption on the way(via TLS) and send them to > a client in a similar way that Netflix does with in-kernel TLS and sendfile. > > All the best, > Mihai > > >