On Wed, May 07, 2025 at 05:50:14PM -0400, Mike Snitzer wrote: > Hey Dave, > > Thanks for providing your thoughts on all this. More inlined below. > > On Wed, May 07, 2025 at 12:50:20PM +1000, Dave Chinner wrote: > > On Tue, May 06, 2025 at 08:06:51PM -0400, Jeff Layton wrote: > > > On Wed, 2025-05-07 at 08:31 +1000, Dave Chinner wrote: > > > > What are the fio control parameters of the IO you are doing? (e.g. > > > > is this single threaded IO, does it use the psync, libaio or iouring > > > > engine, etc) > > > > > > > > > > > > > ; fio-seq-write.job for fiotest > > > > > > [global] > > > name=fio-seq-write > > > filename=fio-seq-write > > > rw=write > > > bs=256K > > > direct=0 > > > numjobs=1 > > > time_based > > > runtime=900 > > > > > > [file1] > > > size=10G > > > ioengine=libaio > > > iodepth=16 > > > > Ok, so we are doing AIO writes on the client side, so we have ~16 > > writes on the wire from the client at any given time. > > Jeff's workload is really underwhelming given he is operating well > within available memory (so avoiding reclaim, etc). As such this test > is really not testing what RWF_DONTCACHE is meant to address (and to > answer Chuck's question of "what do you hope to get from > RWF_DONTCACHE?"): the ability to reach steady state where even if > memory is oversubscribed the network pipes and NVMe devices are as > close to 100% utilization as possible. Right. However, one of the things that has to be kept in mind is that we don't have 100% of the CPU dedicated to servicing RWF_DONTCACHE IO like the fio microbenchmarks have. Applications are going to take a chunk of CPU time to create/marshall/process the data that we we are doing IO on, so any time we spend on doing IO is less time that the applications have to do their work. If you can saturate the storage without saturating CPUs, then RWF_DONTCACHE should allow that steady state to be maintained indefinitely. However, RWF_DONTCACHE does not remove the data copy overhead of buffered IO, whilst it adds IO submission overhead to each IO. Hence it will require more CPU time to saturate the storage devices than normal buffered IO. If you've got CPU to spare, great. If you don't, then overall performance will be reduced. > > This also means they are likely not being received by the NFS server > > in sequential order, and the NFS server is going to be processing > > roughly 16 write RPCs to the same file concurrently using > > RWF_DONTCACHE IO. > > > > These are not going to be exactly sequential - the server side IO > > pattern to the filesystem is quasi-sequential, with random IOs being > > out of order and leaving temporary holes in the file until the OO > > write is processed. > > > > XFS should handle this fine via the speculative preallocation beyond > > EOF that is triggered by extending writes (it was designed to > > mitigate the fragmentation this NFS behaviour causes). However, we > > should always keep in mind that while client side IO is sequential, > > what the server is doing to the underlying filesystem needs to be > > treated as "concurrent IO to a single file" rather than "sequential > > IO". > > Hammerspace has definitely seen that 1MB IO coming off the wire is > fragmented by the time it XFS issues it to underlying storage; so much > so that IOPs bound devices (e.g. AWS devices that are capped at ~10K > IOPs) are choking due to all the small IO. That should not happen in the general case. Can you start a separate thread to triage the issue so we can try to understand why that is happening? > So yeah, minimizing the fragmentation is critical (and largely *not* > solved at this point... hacks like sync mount from NFS client or using > O_DIRECT at the client, which sets sync bit, helps reduce the > fragmentation but as soon as you go full buffered the N=16+ IOs on the > wire will fragment each other). Fragmentation mitigation for NFS server IO is generally only addressable at the filesystem level - it's not really something you can mitigate at the NFS server or client. > Do you recommend any particular tuning to help XFS's speculative > preallocation work for many competing "sequential" IO threads? I can suggest lots of things, but without knowing the IO pattern, the fragmentation pattern, the filesystem state, what triggers the fragmentation, etc, I'd just be guessing as to which knob might make the problem go away (hence the request to separate that out). -Dave. -- Dave Chinner david@xxxxxxxxxxxxx