Re: [PATCH] nfsd: Implement large extent array support in pNFS

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Tue, 10 Jun 2025 23:55:09 -0700

On Tue, Jun 10, 2025 at 06:24:03PM +0300, Sergey Bashirov wrote:
> On Mon, Jun 09, 2025 at 10:39:06PM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 10, 2025 at 03:36:49AM +0300, Sergey Bashirov wrote:
> > > Together with Konstantin we spent a lot of time enabling the pNFS block
> > > volume setup. We have SDS that can attach virtual block devices via
> > > vhost-user-blk to virtual machines. And we researched the way to create
> > > parallel or distributed file system on top of this SDS. From this point
> > > of view, pNFS block volume layout architecture looks quite suitable. So,
> > > we created several VMs, configured pNFS and started testing. In fact,
> > > during our extensive testing, we encountered a variety of issues including
> > > deadlocks, livelocks, and corrupted files, which we eventually fixed.
> > > Now we have a working setup and we would like to clean up the code and
> > > contribute it.
> > 
> > Can you share your reproducer scripts for client and server?
> 
> I will try. First of all, you need two VMs connected to the same network.
> The hardest part is somehow to connect a shared block device to both VMs
> with RW access.

I know the basic setup :)

> On the client side, you need to have the same /dev/vda device available,
> but not mounted. Additionally, you need the blkmapd service running.

blkmapd is only needed for the block layout, which should generally be
avoided as it can't be used reliably because working fencing is
almost impossible.

> This should create 2.5k extents:
> fio --name=test --filename=/mnt/pnfs/test.raw --size=10M \
>     --rw=randwrite --ioengine=libaio --direct=1 --bs=4k  \
>     --iodepth=128 --fallocate=none

Thanks!  We should find a way to wire up the test coverage
somewhere, e.g. xfstests.

> Troubleshooting. If any error occurs, then kernel falls back to NFSv3.

That should really be NFSv4.

> the client code also has problems with the block extent array. Currently
> the client tries to pack all the block extents it needs to commit into
> one RPC. And if there are too many of them, you will see
> "RPC: fragment too large" error on the server side. That's why
> we set rsize and wsize to 1M for now.

We'll really need to fix the client to split when going over the maximum
compoung size.

> Another problem is that when the
> extent array does not fit into a single memory page, the client code
> discards the first page of encoded extents while reallocating a larger
> buffer to continue layout commit encoding. So even with this patch you
> may still notice that some files are not written correctly. But at least
> the server shouldn't send the badxdr error on a well-formed layout commit.

Eww, we'll need to fix that as well.  Would be good to have a reproducer
for that case as well.

> > Btw, also as a little warning:  the current pNFS code mean any client
> > can corrupt the XFS metadata.  If you want to actually use the code
> > in production you'll probably want to figure out a way to either use
> > the RT device for exposed data (should be easy, but the RT allocator
> > sucks..), or find a way to otherwise restrict clients from overwriting
> > metadata.
> 
> Thanks for the advice! Yes, we have had issues with XFS corruption
> especially when multiple clients were writing to the same file in
> parallel. Spent some time debugging layout recalls and client fencing
> to figure out what happened.

Normal operation should not cause that, what did you see there?

I mean a malicious client targeting metadata outside it's layout.