Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > The question is what should happen here to a memory span for which the > > network layer or pipe driver is not allowed to take reference, but rather > > must call a destructor? Particularly if, say, it's just a small part of a > > larger span. > > What is a "span" in this context? In the first case, I was thinking along the lines of a bio_vec that says {physaddr,len} defining a "span" of memory. Basically just a contiguous range of physical addresses, if you prefer. However, someone can, for example, vmsplice a span of memory into a pipe - say they add a whole page, all nicely aligned, but then they splice it out a byte at a time into 4096 other pipes. Each of those other pipes now has a small part of a larger span and needs to share the cleanup information. Now, imagine that a network filesystem writes a message into a TCP socket, where that message corresponds to an RPC call request and includes a number of kernel buffers that the network layer isn't permitted to look at the refcounts on, but rather a destructor must be called. The request message may transit through the loopback driver and get placed on the Rx queue of another TCP socket - from whence it may be spliced off into a pipe. Alternatively, if virtual I/O is involved, this message may get passed down to a layer outside of the system (though I don't think this is, in principle, any different from DMA being done by a NIC). And then there's relayfs and fuse, which seem to do weird stuff. For the splicing of a loop-backed kernel message out of a TCP socket, it might make sense just to copy the message at that point. The problem is that the kernel doesn't know what's going to happen next to it. > In general splice unlike direct I/O relies on page reference counts inside > the splice machinery. But that is configurable through the > pipe_buf_operations. So if you want something to be handled by splice that > does not use simple page refcounts you need special pipe_buf_operations for > it. And you'd better have a really good use case for this to be worthwhile. Yes. vmsplice, is the equivalent of direct I/O and should really do the same pinning thing that, say, write() to an O_DIRECT file does. David