[no subject]

**Date** **Thread**

Write:
IO mode:  access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
--------  ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
O_DIRECT  write     348132     348133     0.002035    278921216  1024.00    0.040978   600.89     384.04     600.90     0   
CACHED    write     295579     295579     0.002416    278921216  1024.00    0.051602   707.73     355.27     707.73     0   

IO mode:  access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
--------  ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
O_DIRECT  read      347971     347973     0.001928    278921216  1024.00    0.017612   601.17     421.30     601.17     0   
CACHED    read      60653      60653      0.006894    278921216  1024.00    0.017279   3448.99    2975.23    3448.99    0   

> > - ability to support more IO threads per client system (from 48 to 64)
> 
> This last item: how do you measure the "ability to support more
> threads"? Is there a latency curve that is flatter? Do you see changes
> in the latency distribution and the number of latency outliers?

Mainly in the context of the IOR benchmark's result, we can see that
increasing PPN becomes detrimental because the score doesn't improve
or gets worse.

> My general comment here is kind of in the "related or future work"
> category. This is not an objection, just thinking out loud.
> 
> But, can we get more insight into specifically where the CPU
> utilization reduction comes from? Is it lock contention? Is it
> inefficient data structure traversal? Any improvement here benefits
> everyone, so that should be a focus of some study.

Buffered IO just commands more resources than O_DIRECT for workloads
with a working set that exceeds system memory.

Each of the 6 servers has 1TiB of memory.

So for the above 6 client 128 PPN IOT "easy" run, each client thread
is writing and then reading 266 GiB.  That creates an aggregate
working set of 199.50 TiB

The 199.50 TiB working set dwarfs the servers' aggregate 6 TiB of
memory.  Being able to drive each of the 8 NVMe in each server as
efficiently as possible is critical.

As you can see from the above NVMe performance above O_DIRECT is best.

> If the memory utilization is a problem, that sounds like an issue with
> kernel systems outside of NFSD, or perhaps some system tuning can be
> done to improve matters. Again, drilling into this and trying to improve
> it will benefit everyone.

Yeah, there is an extensive iceberg level issue with buffered IO and
MM (reclaim's use of kswapd and kcompactd to find free pages) that
underpins the justifcation for RWF_DONTCACHE being developed and
merged.  I'm not the best person to speak to all the long-standing
challenges (Willy, Dave, Jens, others would be better).

> These results do point to some problems, clearly. Whether NFSD using
> direct I/O is the best solution is not obvious to me yet.

All solutions are on the table.  O_DIRECT just happens to be the most
straight-forward to work through at this point.

Dave Chinner's feeling that O_DIRECT a much better solution than
RWF_DONTCACHE for NFSD certainly helped narrow my focus too, from:
https://lore.kernel.org/linux-nfs/aBrKbOoj4dgUvz8f@xxxxxxxxxxxxxxxxxxx/

"The nfs client largely aligns all of the page caceh based IO, so I'd
think that O_DIRECT on the server side would be much more performant
than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT
writes all the way down to the storage....."

(Dave would be correct about NFSD's page alignment if RDMA used, but
obviously not the case if TCP used due to SUNRPC TCP's WRITE payload
being received into misaligned pages).

Thanks,
Mike