On Tue, Jun 10, 2025 at 11:55:09PM -0700, Christoph Hellwig wrote:
On Tue, Jun 10, 2025 at 06:24:03PM +0300, Sergey Bashirov wrote: > the client code also has problems with the block extent array. Currently > the client tries to pack all the block extents it needs to commit into > one RPC. And if there are too many of them, you will see > "RPC: fragment too large" error on the server side. That's why > we set rsize and wsize to 1M for now. We'll really need to fix the client to split when going over the maximum compoung size.
Yes, working on patches to send for review.
> Another problem is that when the > extent array does not fit into a single memory page, the client code > discards the first page of encoded extents while reallocating a larger > buffer to continue layout commit encoding. So even with this patch you > may still notice that some files are not written correctly. But at least > the server shouldn't send the badxdr error on a well-formed layout commit. Eww, we'll need to fix that as well. Would be good to have a reproducer for that case as well.
Will prepare and send patches too. The reproducer should be the same as I send for the server. Just try to check what FIO has written with the additional option --verify=crc32c.
> Thanks for the advice! Yes, we have had issues with XFS corruption > especially when multiple clients were writing to the same file in > parallel. Spent some time debugging layout recalls and client fencing > to figure out what happened. Normal operation should not cause that, what did you see there?
I think, this is not an NFS implementation issue, but rather a question of how to properly implement the client fencing. In a distributed storage system, there is a delay between the time NFS server requests a blocking of writes to a shared volume for a particular client and the time that blocking takes effect. If we choose an optimistic approach and assume that fencing is done by simply sending a request (without waiting for actual processing by the underlying storage system), then we might end up in the following situation. Let's think of layoutget as a byte range locking mechanism in terms of writing to a single file from multiple clients. First of all, a client writing without O_DIRECT through the page cache will delay processing of the layout recall for too long if its user-space application really writes a lot. As a consequences we observe significant performance degradation, and sometimes the server decides that the client is not responding at all and tries to fence it to allow the second client to get the lock. At this moment we still have the first client writing, and the server releasing the xfs_lease associated with the layout of the first client. And if we are really unlucky, XFS might reallocate extents, so the first client will be writing outside the file. And if we are really, really unlucky, XFS might put some metadata there as well. -- Sergey Bashirov