Re: [PATCH] nfsd: Implement large extent array support in pNFS

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Mon, 9 Jun 2025 22:39:06 -0700

On Tue, Jun 10, 2025 at 03:36:49AM +0300, Sergey Bashirov wrote:
> On Wed, Jun 04, 2025 at 07:54:31AM -0700, Christoph Hellwig wrote:
> > On Wed, Jun 04, 2025 at 04:07:08PM +0300, Sergey Bashirov wrote:
> > > When pNFS client in block layout mode sends layoutcommit RPC to MDS,
> > > a variable length array of modified extents is supplied within request.
> > > This patch allows NFS server to accept such extent arrays if they do not
> > > fit within single memory page.
> > 
> > How did you manage to trigger such a workload?
> 
> Together with Konstantin we spent a lot of time enabling the pNFS block
> volume setup. We have SDS that can attach virtual block devices via
> vhost-user-blk to virtual machines. And we researched the way to create
> parallel or distributed file system on top of this SDS. From this point
> of view, pNFS block volume layout architecture looks quite suitable. So,
> we created several VMs, configured pNFS and started testing. In fact,
> during our extensive testing, we encountered a variety of issues including
> deadlocks, livelocks, and corrupted files, which we eventually fixed.
> Now we have a working setup and we would like to clean up the code and
> contribute it.

Can you share your reproducer scripts for client and server?

Btw, also as a little warning:  the current pNFS code mean any client
can corrupt the XFS metadata.  If you want to actually use the code
in production you'll probably want to figure out a way to either use
the RT device for exposed data (should be easy, but the RT allocator
sucks..), or find a way to otherwise restrict clients from overwriting
metadata.

> > Also you patch doesn't apply to current mainline.
> 
> We will use the git repository link for NFSD from the MAINTAINERS file and
> "nfsd-next" branch to work on top of it. Please let me know if this is the
> wrong repository or branch in our case.

I guess that's generally fine, but around a -rc1 release it's a bit
cumbersome.

> As for the sub-buffer, the xdr_buf structure is initialized in the core
> nfsd code to point only to the "opaque" field of the "layoutupdate4"
> structure. Since this field is specific to each layout driver, its
> xdr_stream is created on demand inside the field handler. For example,
> the "opaque" field is not used in the file layouts. Do we really need to
> expose the xdr_stream outside the field handler? Probably not. I also
> checked how this is implemented in the nfs client code and found that
> xdr_stream is created in a similar way inside the layout driver. Below
> I have outlined some thoughts on why implemented it this way. Please
> correct me if I missed anything.

Well, the fields are opaque, but everyone has to either decode (or
ignore it).  So having common setup sounds useful.

> 1. Allocate a large enough memory buffer and copy the "opaque" field into
>    it. But I think an extra copy of a large field is not what we prefer.

Agreed.

> 2. When RPC is received, nfsd_dispatch() first decodes the entire compound
>    request and only then processes each operation. Yes, we can create a new
>    callback in the layout driver interface to decode the "opaque" field
>    during the decoding phase and use the actual xdr stream of the request.
>    What I don't like here is that the layout driver is forced to parse a
>    large data buffer before general checks are done (sequence ID, file
>    handler, state ID, range, grace period, etc.). This opens up
>    opportunities to abuse the server by sending invalid layout commits with
>    the maximum possible number of extents (RPC can be up to 1MB).

OTOH the same happens for parsing any other NFS compound that isn't
split into layouts, isn't it?  And we have total size limits on the
transfer.