On Mon, Jun 09, 2025 at 10:39:06PM -0700, Christoph Hellwig wrote:
On Tue, Jun 10, 2025 at 03:36:49AM +0300, Sergey Bashirov wrote: > Together with Konstantin we spent a lot of time enabling the pNFS block > volume setup. We have SDS that can attach virtual block devices via > vhost-user-blk to virtual machines. And we researched the way to create > parallel or distributed file system on top of this SDS. From this point > of view, pNFS block volume layout architecture looks quite suitable. So, > we created several VMs, configured pNFS and started testing. In fact, > during our extensive testing, we encountered a variety of issues including > deadlocks, livelocks, and corrupted files, which we eventually fixed. > Now we have a working setup and we would like to clean up the code and > contribute it. Can you share your reproducer scripts for client and server?
I will try. First of all, you need two VMs connected to the same network. The hardest part is somehow to connect a shared block device to both VMs with RW access. As I mentioned, we use a proprietary SDS for it. Note that if you pass the device to the VM as virtio-blk, the client will select the block layout driver, and if you pass the device to the VM as virtio-scsi, the client will select the scsi layout driver. Script I use to start VMs (both use the same parameters, only the names, boot images and network addresses differ): #!/bin/sh BOOT_DISK="-drive format=qcow2,file=pnfs_server.img" SSH_NET="-device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:57 -netdev user,id=net0,hostfwd=tcp::20001-:22" NFS_NET="-device e1000,netdev=net1,mac=52:54:00:12:34:67 -netdev socket,id=net1,listen=:1234" MP="/dev/hugepages/libvirt/qemu" VHOST_DISK="-object memory-backend-file,id=mem,size=8G,mem-path=$MP,share=on -numa node,memdev=mem -chardev socket,id=char1,path=/var/lib/storage/pnfs_server.sock -device vhost-user-blk-pci,id=blk1,chardev=char1,num-queues=16" qemu-system-x86_64 -daemonize -display none -name pnfs_server \ -cpu host -enable-kvm -smp 8 -m 8G \ $BOOT_DISK $SSH_NET $VHOST_DISK $NFS_NET The server's /etc/nfs.conf: [nfsd] grace-time=90 lease-time=120 vers2=n vers3=y vers4=n vers4.0=n vers4.1=n vers4.2=y The server's /etc/exports: /mnt/export *(pnfs,rw,sync,insecure,no_root_squash,no_subtree_check) Please note that the block volume layout does not support partition tables, volume groups, RAID, etc. So you need to create XFS on the raw block device. In my case shared virtio-blk disk is /dev/vda. And the file system can be prepared by following these steps: sudo mkfs -t xfs /dev/vda sudo mkdir -p /mnt/export sudo mount -t xfs /dev/vda /mnt/export After these steps you can start or restart the server: sudo systemctl restart nfs-kernel-server On the client side, you need to have the same /dev/vda device available, but not mounted. Additionally, you need the blkmapd service running. Perform the following steps to mount the share: sudo systemctl start nfs-blkmap sudo mkdir -p /mnt/pnfs sudo mount -t nfs4 -v -o minorversion=2,sync,hard,noatime, rsize=1048576,wsize=1048576,timeo=600, retrans=2 192.168.1.1:/mnt/export /mnt/pnfs This should create 2.5k extents: fio --name=test --filename=/mnt/pnfs/test.raw --size=10M \ --rw=randwrite --ioengine=libaio --direct=1 --bs=4k \ --iodepth=128 --fallocate=none You can check it on the server side: xfs_bmap -elpvv /mnt/export/test.raw Troubleshooting. If any error occurs, then kernel falls back to NFSv3. Use nfsstat or mountstats to view RPC counters. Normally the READ and WRITE counters should be zero, and the LAYOUTGET, LAYOUTCOMMIT, LAYOUTRETURN should increase as you work with files. If the network connection and shared volume are working fine, then first of all check the status of blkmapd, most probably its fault is the reason. Note that the client code also has problems with the block extent array. Currently the client tries to pack all the block extents it needs to commit into one RPC. And if there are too many of them, you will see "RPC: fragment too large" error on the server side. That's why we set rsize and wsize to 1M for now. Another problem is that when the extent array does not fit into a single memory page, the client code discards the first page of encoded extents while reallocating a larger buffer to continue layout commit encoding. So even with this patch you may still notice that some files are not written correctly. But at least the server shouldn't send the badxdr error on a well-formed layout commit.
Btw, also as a little warning: the current pNFS code mean any client can corrupt the XFS metadata. If you want to actually use the code in production you'll probably want to figure out a way to either use the RT device for exposed data (should be easy, but the RT allocator sucks..), or find a way to otherwise restrict clients from overwriting metadata.
Thanks for the advice! Yes, we have had issues with XFS corruption especially when multiple clients were writing to the same file in parallel. Spent some time debugging layout recalls and client fencing to figure out what happened.
> As for the sub-buffer, the xdr_buf structure is initialized in the core > nfsd code to point only to the "opaque" field of the "layoutupdate4" > structure. Since this field is specific to each layout driver, its > xdr_stream is created on demand inside the field handler. For example, > the "opaque" field is not used in the file layouts. Do we really need to > expose the xdr_stream outside the field handler? Probably not. I also > checked how this is implemented in the nfs client code and found that > xdr_stream is created in a similar way inside the layout driver. Below > I have outlined some thoughts on why implemented it this way. Please > correct me if I missed anything. Well, the fields are opaque, but everyone has to either decode (or ignore it). So having common setup sounds useful. > 2. When RPC is received, nfsd_dispatch() first decodes the entire compound > request and only then processes each operation. Yes, we can create a new > callback in the layout driver interface to decode the "opaque" field > during the decoding phase and use the actual xdr stream of the request. > What I don't like here is that the layout driver is forced to parse a > large data buffer before general checks are done (sequence ID, file > handler, state ID, range, grace period, etc.). This opens up > opportunities to abuse the server by sending invalid layout commits with > the maximum possible number of extents (RPC can be up to 1MB). OTOH the same happens for parsing any other NFS compound that isn't split into layouts, isn't it? And we have total size limits on the transfer.
I agree, one large request and 1000 small requests look the same on the wire. So, setting up an xdr_stream at a higher level requires adding it to the nfsd4_layoutcommit strucutre. Either as a substructure, which will significantly increase the overall size of the layout commit argument, or as a pointer, which will require some memory allocation and deallocation logic. Also, in the core nfsd code we don't know the sufficient scratch buffer size for a particular layout driver, most likely we will allocate a page for it. This all seems a bit overengineered compared to two local variables on the stack. I will think about it a little more. -- Sergey Bashirov