On Wed, Jun 04, 2025 at 10:10:15AM -0400, Chuck Lever wrote:
Can you repost your patch with the current reviewers and maintainers copied as appropriate?
Sorry for the newbie mistake, this is the first patch I'm sending upstream. Yes, I will send patch v2 with the right reviewers and maintainers. On Wed, Jun 04, 2025 at 07:54:31AM -0700, Christoph Hellwig wrote:
On Wed, Jun 04, 2025 at 04:07:08PM +0300, Sergey Bashirov wrote: > When pNFS client in block layout mode sends layoutcommit RPC to MDS, > a variable length array of modified extents is supplied within request. > This patch allows NFS server to accept such extent arrays if they do not > fit within single memory page. How did you manage to trigger such a workload?
Together with Konstantin we spent a lot of time enabling the pNFS block volume setup. We have SDS that can attach virtual block devices via vhost-user-blk to virtual machines. And we researched the way to create parallel or distributed file system on top of this SDS. From this point of view, pNFS block volume layout architecture looks quite suitable. So, we created several VMs, configured pNFS and started testing. In fact, during our extensive testing, we encountered a variety of issues including deadlocks, livelocks, and corrupted files, which we eventually fixed. Now we have a working setup and we would like to clean up the code and contribute it.
Also you patch doesn't apply to current mainline.
We will use the git repository link for NFSD from the MAINTAINERS file and "nfsd-next" branch to work on top of it. Please let me know if this is the wrong repository or branch in our case.
> +++ b/fs/nfsd/blocklayoutxdr.c > @@ -103,11 +103,13 @@ nfsd4_block_encode_getdeviceinfo(struct xdr_stream *xdr, > } > > int > -nfsd4_block_decode_layoutupdate(__be32 *p, u32 len, struct iomap **iomapp, > - u32 block_size) > +nfsd4_block_decode_layoutupdate(struct xdr_buf *buf, u32 len, > + struct iomap **iomapp, u32 block_size) > { So if you pass the xdr_buf that already has the length, can't we drop the len argument?
Thanks for the suggestions! I will rework it in patch v2 to get rid of the len argument and the lc_up_len field.
> + struct xdr_stream xdr; Also I'm not the expert on the xdr_stream API, but instead of setting up a sub-buffer here shouldn't we reuse that from the actual XDR decoding stage, and if that can't work (Chuck is probably a better source for know how on that then me), at least initialize it in the main nfsd layoutcommit handler rather than in the layout drivers. me we'd probably want that initialized in the core nfsd code and not the layout driver.
As for the sub-buffer, the xdr_buf structure is initialized in the core nfsd code to point only to the "opaque" field of the "layoutupdate4" structure. Since this field is specific to each layout driver, its xdr_stream is created on demand inside the field handler. For example, the "opaque" field is not used in the file layouts. Do we really need to expose the xdr_stream outside the field handler? Probably not. I also checked how this is implemented in the nfs client code and found that xdr_stream is created in a similar way inside the layout driver. Below I have outlined some thoughts on why implemented it this way. Please correct me if I missed anything. Let's say we have a layout commit with 1000 extents (in practice we observed much more). The size of each block extent structure is 44 bytes, and we have another 4 bytes for the total number of extents. So, the "opaque" field of the "layoutupdate4" structure has a size of 44004 bytes. In this case, we cannot simply borrow a pointer to the xdr internal page or the scratch buffer using xdr_inline_decode(). What options do we have? 1. Allocate a large enough memory buffer and copy the "opaque" field into it. But I think an extra copy of a large field is not what we prefer. 2. When RPC is received, nfsd_dispatch() first decodes the entire compound request and only then processes each operation. Yes, we can create a new callback in the layout driver interface to decode the "opaque" field during the decoding phase and use the actual xdr stream of the request. What I don't like here is that the layout driver is forced to parse a large data buffer before general checks are done (sequence ID, file handler, state ID, range, grace period, etc.). This opens up opportunities to abuse the server by sending invalid layout commits with the maximum possible number of extents (RPC can be up to 1MB). 3. It is also possible to store only the position of the "opaque" field in the actual xdr stream of the request during the decoding phase and defer large data buffer parsing until the processing phase. This is what the original code does, and this patch takes it in the same direction. -- Sergey Bashirov