Re: [PATCH] nfsd: Implement large extent array support in pNFS

Sergey Bashirov <sergeybashirov@xxxxxxxxx> · Tue, 10 Jun 2025 18:24:03 +0300

On Mon, Jun 09, 2025 at 10:39:06PM -0700, Christoph Hellwig wrote:
On Tue, Jun 10, 2025 at 03:36:49AM +0300, Sergey Bashirov wrote:
> Together with Konstantin we spent a lot of time enabling the pNFS block
> volume setup. We have SDS that can attach virtual block devices via
> vhost-user-blk to virtual machines. And we researched the way to create
> parallel or distributed file system on top of this SDS. From this point
> of view, pNFS block volume layout architecture looks quite suitable. So,
> we created several VMs, configured pNFS and started testing. In fact,
> during our extensive testing, we encountered a variety of issues including
> deadlocks, livelocks, and corrupted files, which we eventually fixed.
> Now we have a working setup and we would like to clean up the code and
> contribute it.

Can you share your reproducer scripts for client and server?

I will try. First of all, you need two VMs connected to the same network.
The hardest part is somehow to connect a shared block device to both VMs
with RW access. As I mentioned, we use a proprietary SDS for it. Note
that if you pass the device to the VM as virtio-blk, the client will
select the block layout driver, and if you pass the device to the VM
as virtio-scsi, the client will select the scsi layout driver.

Script I use to start VMs (both use the same parameters, only the names,
boot images and network addresses differ):
#!/bin/sh
BOOT_DISK="-drive format=qcow2,file=pnfs_server.img"
SSH_NET="-device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:57
         -netdev user,id=net0,hostfwd=tcp::20001-:22"
NFS_NET="-device e1000,netdev=net1,mac=52:54:00:12:34:67
         -netdev socket,id=net1,listen=:1234"
MP="/dev/hugepages/libvirt/qemu"
VHOST_DISK="-object memory-backend-file,id=mem,size=8G,mem-path=$MP,share=on
            -numa node,memdev=mem
            -chardev socket,id=char1,path=/var/lib/storage/pnfs_server.sock
            -device vhost-user-blk-pci,id=blk1,chardev=char1,num-queues=16"
qemu-system-x86_64 -daemonize -display none -name pnfs_server \
                   -cpu host -enable-kvm -smp 8 -m 8G \
                   $BOOT_DISK $SSH_NET $VHOST_DISK $NFS_NET

The server's /etc/nfs.conf:
[nfsd]
grace-time=90
lease-time=120
vers2=n
vers3=y
vers4=n
vers4.0=n
vers4.1=n
vers4.2=y

The server's /etc/exports:
/mnt/export *(pnfs,rw,sync,insecure,no_root_squash,no_subtree_check)

Please note that the block volume layout does not support partition
tables, volume groups, RAID, etc. So you need to create XFS on the raw
block device. In my case shared virtio-blk disk is /dev/vda.
And the file system can be prepared by following these steps:
sudo mkfs -t xfs /dev/vda
sudo mkdir -p /mnt/export
sudo mount -t xfs /dev/vda /mnt/export

After these steps you can start or restart the server:
sudo systemctl restart nfs-kernel-server

On the client side, you need to have the same /dev/vda device available,
but not mounted. Additionally, you need the blkmapd service running.
Perform the following steps to mount the share:
sudo systemctl start nfs-blkmap
sudo mkdir -p /mnt/pnfs
sudo mount -t nfs4 -v -o minorversion=2,sync,hard,noatime,
                         rsize=1048576,wsize=1048576,timeo=600,
                         retrans=2 192.168.1.1:/mnt/export
                         /mnt/pnfs

This should create 2.5k extents:
fio --name=test --filename=/mnt/pnfs/test.raw --size=10M \
    --rw=randwrite --ioengine=libaio --direct=1 --bs=4k  \
    --iodepth=128 --fallocate=none

You can check it on the server side:
xfs_bmap -elpvv /mnt/export/test.raw

Troubleshooting. If any error occurs, then kernel falls back to NFSv3.
Use nfsstat or mountstats to view RPC counters. Normally the READ and
WRITE counters should be zero, and the LAYOUTGET, LAYOUTCOMMIT,
LAYOUTRETURN should increase as you work with files. If the network
connection and shared volume are working fine, then first of all check
the status of blkmapd, most probably its fault is the reason. Note that
the client code also has problems with the block extent array. Currently
the client tries to pack all the block extents it needs to commit into
one RPC. And if there are too many of them, you will see
"RPC: fragment too large" error on the server side. That's why
we set rsize and wsize to 1M for now. Another problem is that when the
extent array does not fit into a single memory page, the client code
discards the first page of encoded extents while reallocating a larger
buffer to continue layout commit encoding. So even with this patch you
may still notice that some files are not written correctly. But at least
the server shouldn't send the badxdr error on a well-formed layout commit.

Btw, also as a little warning:  the current pNFS code mean any client
can corrupt the XFS metadata.  If you want to actually use the code
in production you'll probably want to figure out a way to either use
the RT device for exposed data (should be easy, but the RT allocator
sucks..), or find a way to otherwise restrict clients from overwriting
metadata.

Thanks for the advice! Yes, we have had issues with XFS corruption
especially when multiple clients were writing to the same file in
parallel. Spent some time debugging layout recalls and client fencing
to figure out what happened.

> As for the sub-buffer, the xdr_buf structure is initialized in the core
> nfsd code to point only to the "opaque" field of the "layoutupdate4"
> structure. Since this field is specific to each layout driver, its
> xdr_stream is created on demand inside the field handler. For example,
> the "opaque" field is not used in the file layouts. Do we really need to
> expose the xdr_stream outside the field handler? Probably not. I also
> checked how this is implemented in the nfs client code and found that
> xdr_stream is created in a similar way inside the layout driver. Below
> I have outlined some thoughts on why implemented it this way. Please
> correct me if I missed anything.

Well, the fields are opaque, but everyone has to either decode (or
ignore it).  So having common setup sounds useful.

> 2. When RPC is received, nfsd_dispatch() first decodes the entire compound
>    request and only then processes each operation. Yes, we can create a new
>    callback in the layout driver interface to decode the "opaque" field
>    during the decoding phase and use the actual xdr stream of the request.
>    What I don't like here is that the layout driver is forced to parse a
>    large data buffer before general checks are done (sequence ID, file
>    handler, state ID, range, grace period, etc.). This opens up
>    opportunities to abuse the server by sending invalid layout commits with
>    the maximum possible number of extents (RPC can be up to 1MB).

OTOH the same happens for parsing any other NFS compound that isn't
split into layouts, isn't it?  And we have total size limits on the
transfer.

I agree, one large request and 1000 small requests look the same on the
wire. So, setting up an xdr_stream at a higher level requires adding it
to the nfsd4_layoutcommit strucutre. Either as a substructure, which will
significantly increase the overall size of the layout commit argument, or
as a pointer, which will require some memory allocation and deallocation
logic. Also, in the core nfsd code we don't know the sufficient scratch
buffer size for a particular layout driver, most likely we will allocate
a page for it. This all seems a bit overengineered compared to two local
variables on the stack. I will think about it a little more.

--
Sergey Bashirov