Re: performance r nfsd with RWF_DONTCACHE and larger wsizes

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 8 May 2025 11:13:23 +1000

On Wed, May 07, 2025 at 09:43:05AM -0400, Chuck Lever wrote:
> On 5/6/25 10:50 PM, Dave Chinner wrote:
> > Ok, so buffered writes (even with RWF_DONTCACHE) are not processed
> > concurrently by XFS - there's an exclusive lock on the inode that
> > will be serialising all the buffered write IO.
> > 
> > Given that most of the work that XFS will be doing during the write
> > will not require releasing the CPU, there is a good chance that
> > there is spin contention on the i_rwsem from the 15 other write
> > waiters.
> 
> This observation echoes my experience with a client pushing 16MB
> writes via 1MB NFS WRITEs to one file. They are serialized on the server
> by the i_rwsem (or a similar generic per-file lock). The first NFS WRITE
> to be emitted by the client is as fast as can be expected, but the RTT
> of the last NFS WRITE to be emitted by the client is almost exactly 16
> times longer.

Yes, that is the symptom that will be visible if you just batch
write IO 16 at a time. If you allow AIO submission up to a depth
of 16 (i.e. first 16 submit in a batch, then submit new IO in
completion batch sizes) then there is always 16 writes on the wire
instead of it trailing off like 16 -> 0, 16 -> 0, 16 -> 0.

This would at least keep the pipeline full, but it does nothing to
address the IO latency of the server side serialisation.

There is some work in progress to allow concurrent buffered writes
in XFS, and this would largely solve this issue for the NFS
server...

> I've wanted to drill into this for some time, but unfortunately (for me)
> I always seem to have higher priority issues to deal with.

It's really an XFS thing, not an NFS server problem...

> Comparing performance with a similar patch series that implements
> uncached server-side I/O with O_DIRECT rather than RWF_UNCACHED might be
> illuminating.

Yes, that will directly compare concurrent vs serialised submission,
but O_DIRECT will also include IO completion latency in the write
RTT, so overall write throughput can still go down.

In my experience, Improving NFS IO throughput is all about
maximising the number of OTW requests in flight (client side) whilst
simultaneously minimising the latency of individual IO operations
(server side). RWF_DONTCACHE makes the latency of individual
operations somewhat worse, O_DIRECT makes the latency quite a bit
worse. O_DIRECT, however, can mitigate IO latency via concurrency,
but RWF_DONTCACHE cannot (yet).

Hence it is no surprise to me that, everything else being equal,
these server side options actually reduce throughput rather than
improve it...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx