Re: [PATCH] nfsd: Implement large extent array support in pNFS

Sergey Bashirov <sergeybashirov@xxxxxxxxx> · Tue, 10 Jun 2025 03:36:49 +0300

On Wed, Jun 04, 2025 at 10:10:15AM -0400, Chuck Lever wrote:
Can you repost your patch with the current reviewers and maintainers
copied as appropriate?

Sorry for the newbie mistake, this is the first patch I'm sending
upstream. Yes, I will send patch v2 with the right reviewers and
maintainers.

On Wed, Jun 04, 2025 at 07:54:31AM -0700, Christoph Hellwig wrote:
On Wed, Jun 04, 2025 at 04:07:08PM +0300, Sergey Bashirov wrote:
> When pNFS client in block layout mode sends layoutcommit RPC to MDS,
> a variable length array of modified extents is supplied within request.
> This patch allows NFS server to accept such extent arrays if they do not
> fit within single memory page.

How did you manage to trigger such a workload?

Together with Konstantin we spent a lot of time enabling the pNFS block
volume setup. We have SDS that can attach virtual block devices via
vhost-user-blk to virtual machines. And we researched the way to create
parallel or distributed file system on top of this SDS. From this point
of view, pNFS block volume layout architecture looks quite suitable. So,
we created several VMs, configured pNFS and started testing. In fact,
during our extensive testing, we encountered a variety of issues including
deadlocks, livelocks, and corrupted files, which we eventually fixed.
Now we have a working setup and we would like to clean up the code and
contribute it.

Also you patch doesn't apply to current mainline.

We will use the git repository link for NFSD from the MAINTAINERS file and
"nfsd-next" branch to work on top of it. Please let me know if this is the
wrong repository or branch in our case.

> +++ b/fs/nfsd/blocklayoutxdr.c
> @@ -103,11 +103,13 @@ nfsd4_block_encode_getdeviceinfo(struct xdr_stream *xdr,
>  }
>
>  int
> -nfsd4_block_decode_layoutupdate(__be32 *p, u32 len, struct iomap **iomapp,
> -		u32 block_size)
> +nfsd4_block_decode_layoutupdate(struct xdr_buf *buf, u32 len,
> +				struct iomap **iomapp, u32 block_size)
>  {

So if you pass the xdr_buf that already has the length, can't we drop
the len argument?

Thanks for the suggestions! I will rework it in patch v2 to get rid of
the len argument and the lc_up_len field.

> +	struct xdr_stream xdr;

Also I'm not the expert on the xdr_stream API, but instead of setting
up a sub-buffer here shouldn't we reuse that from the actual XDR
decoding stage, and if that can't work (Chuck is probably a better
source for know how on that then me), at least initialize it in the
main nfsd layoutcommit handler rather than in the layout drivers.
me we'd probably want that initialized in the core nfsd code and not
the layout driver.

As for the sub-buffer, the xdr_buf structure is initialized in the core
nfsd code to point only to the "opaque" field of the "layoutupdate4"
structure. Since this field is specific to each layout driver, its
xdr_stream is created on demand inside the field handler. For example,
the "opaque" field is not used in the file layouts. Do we really need to
expose the xdr_stream outside the field handler? Probably not. I also
checked how this is implemented in the nfs client code and found that
xdr_stream is created in a similar way inside the layout driver. Below
I have outlined some thoughts on why implemented it this way. Please
correct me if I missed anything.

Let's say we have a layout commit with 1000 extents (in practice we
observed much more). The size of each block extent structure is 44 bytes,
and we have another 4 bytes for the total number of extents. So, the
"opaque" field of the "layoutupdate4" structure has a size of 44004 bytes.
In this case, we cannot simply borrow a pointer to the xdr internal page
or the scratch buffer using xdr_inline_decode(). What options do we have?

1. Allocate a large enough memory buffer and copy the "opaque" field into
   it. But I think an extra copy of a large field is not what we prefer.

2. When RPC is received, nfsd_dispatch() first decodes the entire compound
   request and only then processes each operation. Yes, we can create a new
   callback in the layout driver interface to decode the "opaque" field
   during the decoding phase and use the actual xdr stream of the request.
   What I don't like here is that the layout driver is forced to parse a
   large data buffer before general checks are done (sequence ID, file
   handler, state ID, range, grace period, etc.). This opens up
   opportunities to abuse the server by sending invalid layout commits with
   the maximum possible number of extents (RPC can be up to 1MB).

3. It is also possible to store only the position of the "opaque" field in
   the actual xdr stream of the request during the decoding phase and defer
   large data buffer parsing until the processing phase. This is what the
   original code does, and this patch takes it in the same direction.

--
Sergey Bashirov