Write: IO mode: access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter -------- ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- O_DIRECT write 348132 348133 0.002035 278921216 1024.00 0.040978 600.89 384.04 600.90 0 CACHED write 295579 295579 0.002416 278921216 1024.00 0.051602 707.73 355.27 707.73 0 IO mode: access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter -------- ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- O_DIRECT read 347971 347973 0.001928 278921216 1024.00 0.017612 601.17 421.30 601.17 0 CACHED read 60653 60653 0.006894 278921216 1024.00 0.017279 3448.99 2975.23 3448.99 0 > > - ability to support more IO threads per client system (from 48 to 64) > > This last item: how do you measure the "ability to support more > threads"? Is there a latency curve that is flatter? Do you see changes > in the latency distribution and the number of latency outliers? Mainly in the context of the IOR benchmark's result, we can see that increasing PPN becomes detrimental because the score doesn't improve or gets worse. > My general comment here is kind of in the "related or future work" > category. This is not an objection, just thinking out loud. > > But, can we get more insight into specifically where the CPU > utilization reduction comes from? Is it lock contention? Is it > inefficient data structure traversal? Any improvement here benefits > everyone, so that should be a focus of some study. Buffered IO just commands more resources than O_DIRECT for workloads with a working set that exceeds system memory. Each of the 6 servers has 1TiB of memory. So for the above 6 client 128 PPN IOT "easy" run, each client thread is writing and then reading 266 GiB. That creates an aggregate working set of 199.50 TiB The 199.50 TiB working set dwarfs the servers' aggregate 6 TiB of memory. Being able to drive each of the 8 NVMe in each server as efficiently as possible is critical. As you can see from the above NVMe performance above O_DIRECT is best. > If the memory utilization is a problem, that sounds like an issue with > kernel systems outside of NFSD, or perhaps some system tuning can be > done to improve matters. Again, drilling into this and trying to improve > it will benefit everyone. Yeah, there is an extensive iceberg level issue with buffered IO and MM (reclaim's use of kswapd and kcompactd to find free pages) that underpins the justifcation for RWF_DONTCACHE being developed and merged. I'm not the best person to speak to all the long-standing challenges (Willy, Dave, Jens, others would be better). > These results do point to some problems, clearly. Whether NFSD using > direct I/O is the best solution is not obvious to me yet. All solutions are on the table. O_DIRECT just happens to be the most straight-forward to work through at this point. Dave Chinner's feeling that O_DIRECT a much better solution than RWF_DONTCACHE for NFSD certainly helped narrow my focus too, from: https://lore.kernel.org/linux-nfs/aBrKbOoj4dgUvz8f@xxxxxxxxxxxxxxxxxxx/ "The nfs client largely aligns all of the page caceh based IO, so I'd think that O_DIRECT on the server side would be much more performant than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT writes all the way down to the storage....." (Dave would be correct about NFSD's page alignment if RDMA used, but obviously not the case if TCP used due to SUNRPC TCP's WRITE payload being received into misaligned pages). Thanks, Mike