On Wed, May 07, 2025 at 09:43:05AM -0400, Chuck Lever wrote: > On 5/6/25 10:50 PM, Dave Chinner wrote: > > Ok, so buffered writes (even with RWF_DONTCACHE) are not processed > > concurrently by XFS - there's an exclusive lock on the inode that > > will be serialising all the buffered write IO. > > > > Given that most of the work that XFS will be doing during the write > > will not require releasing the CPU, there is a good chance that > > there is spin contention on the i_rwsem from the 15 other write > > waiters. > > This observation echoes my experience with a client pushing 16MB > writes via 1MB NFS WRITEs to one file. They are serialized on the server > by the i_rwsem (or a similar generic per-file lock). The first NFS WRITE > to be emitted by the client is as fast as can be expected, but the RTT > of the last NFS WRITE to be emitted by the client is almost exactly 16 > times longer. Yes, that is the symptom that will be visible if you just batch write IO 16 at a time. If you allow AIO submission up to a depth of 16 (i.e. first 16 submit in a batch, then submit new IO in completion batch sizes) then there is always 16 writes on the wire instead of it trailing off like 16 -> 0, 16 -> 0, 16 -> 0. This would at least keep the pipeline full, but it does nothing to address the IO latency of the server side serialisation. There is some work in progress to allow concurrent buffered writes in XFS, and this would largely solve this issue for the NFS server... > I've wanted to drill into this for some time, but unfortunately (for me) > I always seem to have higher priority issues to deal with. It's really an XFS thing, not an NFS server problem... > Comparing performance with a similar patch series that implements > uncached server-side I/O with O_DIRECT rather than RWF_UNCACHED might be > illuminating. Yes, that will directly compare concurrent vs serialised submission, but O_DIRECT will also include IO completion latency in the write RTT, so overall write throughput can still go down. In my experience, Improving NFS IO throughput is all about maximising the number of OTW requests in flight (client side) whilst simultaneously minimising the latency of individual IO operations (server side). RWF_DONTCACHE makes the latency of individual operations somewhat worse, O_DIRECT makes the latency quite a bit worse. O_DIRECT, however, can mitigate IO latency via concurrency, but RWF_DONTCACHE cannot (yet). Hence it is no surprise to me that, everything else being equal, these server side options actually reduce throughput rather than improve it... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx