Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support

Mike Snitzer <snitzer@xxxxxxxxxx> · Thu, 12 Jun 2025 15:08:50 -0400

On Thu, Jun 12, 2025 at 09:46:12AM -0400, Chuck Lever wrote:
> On 6/10/25 4:57 PM, Mike Snitzer wrote:
> > The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> > and memory use, particularly for workloads with a working set that far
> > exceeds the available memory of a given server.  This patchset's
> > changes (though patch 5, patch 6 wasn't written until after
> > benchmarking performed) enabled Hammerspace to improve its IO500.org
> > benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> > Germany) by 25%.
> > 
> > That 25% improvement on IO500 is owed to NFS servers seeing:
> > - reduced CPU usage from 100% to ~50%

Apples: 10 servers, 10 clients, 64 PPN (Processes Per Node):

> >   O_DIRECT:
> >   write: 51% idle, 25% system,   14% IO wait,   2% IRQ
> >   read:  55% idle,  9% system, 32.5% IO wait, 1.5% IRQ

Oranges: 6 servers, 6 clients, 128 PPN:

> >   buffered:
> >   write: 17.8% idle, 67.5% system,   8% IO wait,  2% IRQ
> >   read:  3.29% idle, 94.2% system, 2.5% IO wait,  1% IRQ
> 
> The IO wait and IRQ numbers for the buffered results appear to be
> significantly better than for O_DIRECT. Can you help us understand
> that? Is device utilization better or worse with O_DIRECT?

It was a post-mortum data analysis fail: when I worked with others
(Jon and Keith, cc'd) to collect performance data for use in my 0th
header (above) we didn't have buffered IO performance data from the
full IO500-scale benchmark (10 servers, 10 clients).  I missed that
the above married apples and oranges until you noticed something
off...

Sorry about that.  We don't currently have the full 10 nodes
available, so Keith re-ran IOR "easy" testing with 6 nodes to collect
new data.

NOTE his run was with much larger PPN (128 instead of 64) coupled with
reduction of both the number of client and server nodes (from 10 to 6)
used.

Here is CPU usage for one of the server nodes while running IOR "easy" 
with 128 PPN on each of 6 clients, against 6 servers:

- reduced CPU usage from 100% to ~56% for read and 71.6% for Write
  O_DIRECT:
  write: 28.4% idle, 50% system,   13% IO wait,   2% IRQ
  read:    44% idle, 11% system,   39% IO wait, 1.8% IRQ
  buffered:
  write:   17% idle, 68.4% system, 8.5% IO wait,   2% IRQ
  read:  3.51% idle, 94.5% system,   0% IO wait, 0.6% IRQ

And associated NVMe performance:

- increased NVMe throughtput when comparing O_DIRECT vs buffered:
  O_DIRECT: 10.5 GB/s for writes, 11.6 GB/s for reads
  buffered: 7.75-8 GB/s for writes, 4 GB/s before tip over but 800MB/s after for reads 

("tipover" is when reclaim starts to dominate due to inability to
efficiently find free pages so kswapd and kcompactd burn a lot of
resources).

And again here is the associated 6 node IOR easy NVMe performance in
graph form: https://original.art/NFSD_direct_vs_buffered_IO.jpg

> > - reduced memory usage from just under 100% (987GiB for reads, 978GiB
> >   for writes) to only ~244 MB for cache+buffer use (for both reads and
> >   writes).
> >   - buffered would tip-over due to kswapd and kcompactd struggling to
> >     find free memory during reclaim.

This memory usage data is still the case with the 6 server testbed.

> > - increased NVMe throughtput when comparing O_DIRECT vs buffered:
> >   O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
> >   buffered: 8 GB/s for writes,    4-5 GB/s for reads

And again, here is the end result for the IOR easy benchmark: