Re: [PATCH 0/6] NFSD: add enable-dontcache and initially use it to add DIO support

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 12 Jun 2025 09:46:12 -0400

On 6/10/25 4:57 PM, Mike Snitzer wrote:
> The O_DIRECT performance win is pretty fantastic thanks to reduced CPU
> and memory use, particularly for workloads with a working set that far
> exceeds the available memory of a given server.  This patchset's
> changes (though patch 5, patch 6 wasn't written until after
> benchmarking performed) enabled Hammerspace to improve its IO500.org
> benchmark result (as submitted for this week's ISC 2025 in Hamburg,
> Germany) by 25%.
> 
> That 25% improvement on IO500 is owed to NFS servers seeing:
> - reduced CPU usage from 100% to ~50%
>   O_DIRECT:
>   write: 51% idle, 25% system,   14% IO wait,   2% IRQ
>   read:  55% idle,  9% system, 32.5% IO wait, 1.5% IRQ
>   buffered:
>   write: 17.8% idle, 67.5% system,   8% IO wait,  2% IRQ
>   read:  3.29% idle, 94.2% system, 2.5% IO wait,  1% IRQ

The IO wait and IRQ numbers for the buffered results appear to be
significantly better than for O_DIRECT. Can you help us understand
that? Is device utilization better or worse with O_DIRECT?

> - reduced memory usage from just under 100% (987GiB for reads, 978GiB
>   for writes) to only ~244 MB for cache+buffer use (for both reads and
>   writes).
>   - buffered would tip-over due to kswapd and kcompactd struggling to
>     find free memory during reclaim.
> 
> - increased NVMe throughtput when comparing O_DIRECT vs buffered:
>   O_DIRECT: 8-10 GB/s for writes, 9-11.8 GB/s for reads
>   buffered: 8 GB/s for writes,    4-5 GB/s for reads
> 
> - ability to support more IO threads per client system (from 48 to 64)

This last item: how do you measure the "ability to support more
threads"? Is there a latency curve that is flatter? Do you see changes
in the latency distribution and the number of latency outliers?

My general comment here is kind of in the "related or future work"
category. This is not an objection, just thinking out loud.

But, can we get more insight into specifically where the CPU
utilization reduction comes from? Is it lock contention? Is it
inefficient data structure traversal? Any improvement here benefits
everyone, so that should be a focus of some study.

If the memory utilization is a problem, that sounds like an issue with
kernel systems outside of NFSD, or perhaps some system tuning can be
done to improve matters. Again, drilling into this and trying to improve
it will benefit everyone.

These results do point to some problems, clearly. Whether NFSD using
direct I/O is the best solution is not obvious to me yet.

-- 
Chuck Lever