Re: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:

> > The question is what should happen here to a memory span for which the
> > network layer or pipe driver is not allowed to take reference, but rather
> > must call a destructor?  Particularly if, say, it's just a small part of a
> > larger span.
> 
> What is a "span" in this context?

In the first case, I was thinking along the lines of a bio_vec that says
{physaddr,len} defining a "span" of memory.  Basically just a contiguous range
of physical addresses, if you prefer.

However, someone can, for example, vmsplice a span of memory into a pipe - say
they add a whole page, all nicely aligned, but then they splice it out a byte
at a time into 4096 other pipes.  Each of those other pipes now has a small
part of a larger span and needs to share the cleanup information.

Now, imagine that a network filesystem writes a message into a TCP socket,
where that message corresponds to an RPC call request and includes a number of
kernel buffers that the network layer isn't permitted to look at the refcounts
on, but rather a destructor must be called.  The request message may transit
through the loopback driver and get placed on the Rx queue of another TCP
socket - from whence it may be spliced off into a pipe.

Alternatively, if virtual I/O is involved, this message may get passed down to
a layer outside of the system (though I don't think this is, in principle, any
different from DMA being done by a NIC).

And then there's relayfs and fuse, which seem to do weird stuff.

For the splicing of a loop-backed kernel message out of a TCP socket, it might
make sense just to copy the message at that point.  The problem is that the
kernel doesn't know what's going to happen next to it.

> In general splice unlike direct I/O relies on page reference counts inside
> the splice machinery.  But that is configurable through the
> pipe_buf_operations.  So if you want something to be handled by splice that
> does not use simple page refcounts you need special pipe_buf_operations for
> it.  And you'd better have a really good use case for this to be worthwhile.

Yes.  vmsplice, is the equivalent of direct I/O and should really do the same
pinning thing that, say, write() to an O_DIRECT file does.

David





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux