Re: [PATCH] nfsd: Implement large extent array support in pNFS

Sergey Bashirov <sergeybashirov@xxxxxxxxx> · Wed, 11 Jun 2025 15:19:29 +0300

On Tue, Jun 10, 2025 at 11:55:09PM -0700, Christoph Hellwig wrote:
On Tue, Jun 10, 2025 at 06:24:03PM +0300, Sergey Bashirov wrote:
> the client code also has problems with the block extent array. Currently
> the client tries to pack all the block extents it needs to commit into
> one RPC. And if there are too many of them, you will see
> "RPC: fragment too large" error on the server side. That's why
> we set rsize and wsize to 1M for now.

We'll really need to fix the client to split when going over the maximum
compoung size.

Yes, working on patches to send for review.

> Another problem is that when the
> extent array does not fit into a single memory page, the client code
> discards the first page of encoded extents while reallocating a larger
> buffer to continue layout commit encoding. So even with this patch you
> may still notice that some files are not written correctly. But at least
> the server shouldn't send the badxdr error on a well-formed layout commit.

Eww, we'll need to fix that as well.  Would be good to have a reproducer
for that case as well.

Will prepare and send patches too. The reproducer should be the same
as I send for the server. Just try to check what FIO has written with
the additional option --verify=crc32c.

> Thanks for the advice! Yes, we have had issues with XFS corruption
> especially when multiple clients were writing to the same file in
> parallel. Spent some time debugging layout recalls and client fencing
> to figure out what happened.

Normal operation should not cause that, what did you see there?

I think, this is not an NFS implementation issue, but rather a question
of how to properly implement the client fencing. In a distributed
storage system, there is a delay between the time NFS server requests
a blocking of writes to a shared volume for a particular client and the
time that blocking takes effect. If we choose an optimistic approach and
assume that fencing is done by simply sending a request (without waiting
for actual processing by the underlying storage system), then we might
end up in the following situation.

Let's think of layoutget as a byte range locking mechanism in terms of
writing to a single file from multiple clients. First of all, a client
writing without O_DIRECT through the page cache will delay processing
of the layout recall for too long if its user-space application really
writes a lot. As a consequences we observe significant performance
degradation, and sometimes the server decides that the client is not
responding at all and tries to fence it to allow the second client to
get the lock. At this moment we still have the first client writing, and
the server releasing the xfs_lease associated with the layout of the
first client. And if we are really unlucky, XFS might reallocate extents,
so the first client will be writing outside the file. And if we are
really, really unlucky, XFS might put some metadata there as well.

--
Sergey Bashirov