On Wed, 2025-05-07 at 17:50 -0400, Mike Snitzer wrote: > Hey Dave, > > Thanks for providing your thoughts on all this. More inlined below. > > On Wed, May 07, 2025 at 12:50:20PM +1000, Dave Chinner wrote: > > On Tue, May 06, 2025 at 08:06:51PM -0400, Jeff Layton wrote: > > > On Wed, 2025-05-07 at 08:31 +1000, Dave Chinner wrote: > > > > On Tue, May 06, 2025 at 01:40:35PM -0400, Jeff Layton wrote: > > > > > FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE > > > > > patches for nfsd [1]. Those add a module param that make all reads and > > > > > writes use RWF_DONTCACHE. > > > > > > > > > > I had one host that was running knfsd with an XFS export, and a second > > > > > that was acting as NFS client. Both machines have tons of memory, so > > > > > pagecache utilization is irrelevant for this test. > > > > > > > > Does RWF_DONTCACHE result in server side STABLE write requests from > > > > the NFS client, or are they still unstable and require a post-write > > > > completion COMMIT operation from the client to trigger server side > > > > writeback before the client can discard the page cache? > > > > > > > > > > The latter. I didn't change the client at all here (other than to allow > > > it to do bigger writes on the wire). It's just doing bog-standard > > > buffered I/O. nfsd is adding RWF_DONTCACHE to every write via Mike's > > > patch. > > > > Ok, that wasn't clear that it was only server side RWF_DONTCACHE. > > > > I have some more context from a different (internal) discussion > > thread about how poorly the NFSD read side performs with > > RWF_DONTCACHE compared to O_DIRECT. This is because there is massive > > page allocator spin lock contention due to all the concurrent reads > > being serviced. > > That discussion started with: its a very chaotic workload "read a > bunch of large files that cause memory to be oversubscribed 2.5x > across 8 servers". Many knfsd threads (~240) per server handling 1MB > IO to 8 XFS on NVMe.. (so 8 servers, each with 8 NVMe devices). > > For others' benefit here is the flamegraph for this heavy > nfsd.nfsd_dontcache=Y read workload as seen on 1 of the 8 servers: > https://original.art/dontcache_read.svg > > Dave offered this additional analysis: > "the flame graph indicates massive lock contention in the page > allocator (i.e. on the page free lists). There's a chunk of time in > data copying (copy_page_to_iter), but 70% of the CPU usage looks to be > page allocator spinlock contention." > > All this causes RWF_DONTCACHE reads to be considerably slower than > normal buffered reads (only getting 40-66% of normal buffered reads, > worse read performance occurs when the system is less loaded). How > knfsd is handling the IO seems to be contributing to the 100% cpu > usage. If fio is used (with pvsync2 and uncached=1) directly to a > single XFS then CPU is ~50%. > > (Jeff: not following why you were seeing EOPNOTSUPP for RWF_DONTCACHE > reads, is that somehow due to the rsize/wsize patches from Chuck? > RWF_DONTCACHE reads work with my patch you quoted as "[1]"). > Possibly. I'm not sure either. I hit that error on reads with the RWF_DONTCACHE enabled and decided to focus on writes for the moment. I'll run it down when I get a chance. > > The buffered write path locking is different, but I suspect > > something similar is occurring and I'm going to ask you to confirm > > it... > I started collecting perf traces today, but I'm having trouble getting meaningful reports out of it. So, I'm working on it, but stay tuned. > With knfsd to XFS on NVMe, favorable difference for RWF_DONTCACHE > writes is that despite also seeing 100% CPU usage, due to lock > contention et al, RWF_DONTCACHE does perform 0-54% better compared to > normal buffered writes that exceed the system's memory by 2.5x > (largest gains seen with most extreme load). > > Without RWF_DONTCACHE the system gets pushed to reclaim and the > associated work really hurts. > That makes total sense. The boxes I've been testing on have gobs of memory. The system never gets pushed into reclaim. It sounds like I need to do some testing with small memory sizes (maybe in VMs). > As tested with knfsd we've been generally unable to see the > reduced CPU usage that is documented in Jens' commit headers: > for reads: https://git.kernel.org/linus/8026e49bff9b > for writes: https://git.kernel.org/linus/d47c670061b5 > But as mentioned above, eliminating knfsd and testing XFS directly > with fio does generally reflect what Jens documented. > > So more work needed to address knfsd RWF_DONTCACHE inefficiencies. > Agreed. > > > > > I tested sequential writes using the fio-seq_write.fio test, both with > > > > > and without the module param enabled. > > > > > > > > > > These numbers are from one run each, but they were pretty stable over > > > > > several runs: > > > > > > > > > > # fio /usr/share/doc/fio/examples/fio-seq-write.fio > > > > > > > > $ cat /usr/share/doc/fio/examples/fio-seq-write.fio > > > > cat: /usr/share/doc/fio/examples/fio-seq-write.fio: No such file or directory > > > > $ > > > > > > > > What are the fio control parameters of the IO you are doing? (e.g. > > > > is this single threaded IO, does it use the psync, libaio or iouring > > > > engine, etc) > > > > > > > > > > > > > ; fio-seq-write.job for fiotest > > > > > > [global] > > > name=fio-seq-write > > > filename=fio-seq-write > > > rw=write > > > bs=256K > > > direct=0 > > > numjobs=1 > > > time_based > > > runtime=900 > > > > > > [file1] > > > size=10G > > > ioengine=libaio > > > iodepth=16 > > > > Ok, so we are doing AIO writes on the client side, so we have ~16 > > writes on the wire from the client at any given time. > > Jeff's workload is really underwhelming given he is operating well > within available memory (so avoiding reclaim, etc). As such this test > is really not testing what RWF_DONTCACHE is meant to address (and to > answer Chuck's question of "what do you hope to get from > RWF_DONTCACHE?"): the ability to reach steady state where even if > memory is oversubscribed the network pipes and NVMe devices are as > close to 100% utilization as possible. > I'll see about setting up something more memory-constrained on the server side. That would be more interesting for sure. > > This also means they are likely not being received by the NFS server > > in sequential order, and the NFS server is going to be processing > > roughly 16 write RPCs to the same file concurrently using > > RWF_DONTCACHE IO. > > > > These are not going to be exactly sequential - the server side IO > > pattern to the filesystem is quasi-sequential, with random IOs being > > out of order and leaving temporary holes in the file until the OO > > write is processed. > > > > XFS should handle this fine via the speculative preallocation beyond > > EOF that is triggered by extending writes (it was designed to > > mitigate the fragmentation this NFS behaviour causes). However, we > > should always keep in mind that while client side IO is sequential, > > what the server is doing to the underlying filesystem needs to be > > treated as "concurrent IO to a single file" rather than "sequential > > IO". > > Hammerspace has definitely seen that 1MB IO coming off the wire is > fragmented by the time it XFS issues it to underlying storage; so much > so that IOPs bound devices (e.g. AWS devices that are capped at ~10K > IOPs) are choking due to all the small IO. > > So yeah, minimizing the fragmentation is critical (and largely *not* > solved at this point... hacks like sync mount from NFS client or using > O_DIRECT at the client, which sets sync bit, helps reduce the > fragmentation but as soon as you go full buffered the N=16+ IOs on the > wire will fragment each other). > > Do you recommend any particular tuning to help XFS's speculative > preallocation work for many competing "sequential" IO threads? Like > would having 32 AG allow for 32 speculative preallocation engines? Or > is it only possible to split across AG for different inodes? > (Sorry, I really do aim to get more well-versed with XFS... its only > been ~17 years that it has featured in IO stacks I've had to > engineer, ugh...). > > > > > > wsize=1M: > > > > > > > > > > Normal: WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec > > > > > DONTCACHE: WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec > > > > > > > > > > DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30% > > > > > slower. Memory consumption was down, but these boxes have oodles of > > > > > memory, so I didn't notice much change there. > > > > > > > > So what is the IO pattern that the NFSD is sending to the underlying > > > > XFS filesystem? > > > > > > > > Is it sending 1M RWF_DONTCACHE buffered IOs to XFS as well (i.e. one > > > > buffered write IO per NFS client write request), or is DONTCACHE > > > > only being used on the NFS client side? > > > > > > > > > > It's should be sequential I/O, though the writes would be coming in > > > from different nfsd threads. nfsd just does standard buffered I/O. The > > > WRITE handler calls nfsd_vfs_write(), which calls vfs_write_iter(). > > > With the module parameter enabled, it also adds RWF_DONTCACHE. > > > > Ok, so buffered writes (even with RWF_DONTCACHE) are not processed > > concurrently by XFS - there's an exclusive lock on the inode that > > will be serialising all the buffered write IO. > > > > Given that most of the work that XFS will be doing during the write > > will not require releasing the CPU, there is a good chance that > > there is spin contention on the i_rwsem from the 15 other write > > waiters. > > > > That may be a contributing factor to poor performance, so kernel > > profiles from the NFS server for both the normal buffered write path > > as well as the RWF_DONTCACHE buffered write path. Having some idea > > of the total CPU usage of the nfsds during the workload would also > > be useful. > > > > > DONTCACHE is only being used on the server side. To be clear, the > > > protocol doesn't support that flag (yet), so we have no way to project > > > DONTCACHE from the client to the server (yet). This is just early > > > exploration to see whether DONTCACHE offers any benefit to this > > > workload. > > > > The nfs client largely aligns all of the page caceh based IO, so I'd > > think that O_DIRECT on the server side would be much more performant > > than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT > > writes all the way down to the storage..... > > Yes. We really need to add full-blown O_DIRECT support to knfsd. And > Hammerspace wants me to work on it ASAP. But I welcome all the help I > can get, I have ideas but look forward to discussing next week at > Bakeathon and/or in this thread... > > The first hurdle is coping with the head and/or tail of IO being > misaligned relative to the underlying storage's logical_block_size. > Need to cull off misaligned IO and use RWF_DONTCACHE for those but > O_DIRECT for the aligned middle is needed. > > I aim to deal with that for NFS LOCALIO first (NFS client issues > IO direct to XFS, bypassing knfsd) and then reuse it for knfsd's > O_DIRECT support. > I'll be interested to hear your thoughts on this! > > > > > I wonder if we need some heuristic that makes generic_write_sync() only > > > > > kick off writeback immediately if the whole folio is dirty so we have > > > > > more time to gather writes before kicking off writeback? > > > > > > > > You're doing aligned 1MB IOs - there should be no partially dirty > > > > large folios in either the client or the server page caches. > > > > > > Interesting. I wonder what accounts for the slowdown with 1M writes? It > > > seems likely to be related to the more aggressive writeback with > > > DONTCACHE enabled, but it'd be good to understand this. > > > > What I suspect is that block layer IO submission latency has > > increased significantly with RWF_DONTCACHE and that is slowing down > > the rate at which it can service buffered writes to a single file. > > > > The difference between normal buffered writes and RWF_DONTCACHE is > > that the write() context will marshall the dirty folios into bios > > and submit them to the block layer (via generic_write_sync()). If > > the underlying device queues are full, then the bio submission will > > be throttled to wait for IO completion. > > > > At this point, all NFSD write processing to that file stalls. All > > the other nfsds are blocked on the i_rwsem, and that can't be > > released until the holder is released by the block layer throttling. > > Hence any time the underlying device queue fills, nfsd processing of > > incoming writes stalls completely. > > > > When doing normal buffered writes, this IO submission stalling does > > not occur because there is no direct writeback occurring in the > > write() path. > > > > Remember the bad old days of balance_dirty_pages() doing dirty > > throttling by submitting dirty pages for IO directly in the write() > > context? And how much better buffered write performance and write() > > submission latency became when we started deferring that IO to the > > writeback threads and waiting on completions? > > > > We're essentially going back to the bad old days with buffered > > RWF_DONTCACHE writes. Instead of one nicely formed background > > writeback stream that can be throttled at the block layer without > > adversely affecting incoming write throughput, we end up with every > > write() context submitting IO synchronously and being randomly > > throttled by the block layer throttle.... > > > > There are a lot of reasons the current RWF_DONTCACHE implementation > > is sub-optimal for common workloads. This IO spraying and submission > > side throttling problem > > is one of the reasons why I suggested very early on that an async > > write-behind window (similar in concept to async readahead winodws) > > would likely be a much better generic solution for RWF_DONTCACHE > > writes. This would retain the "one nicely formed background > > writeback stream" behaviour that is desirable for buffered writes, > > but still allow in rapid reclaim of DONTCACHE folios as IO cleans > > them... > > I recall you voicing this concern and nobody really seizing on it. > Could be that Jens is open changing the RWF_DONTCACHE implementation > if/when more proof is made for the need? > It does seem like using RWF_DONTCACHE currently leads to a lot of fragmented I/O. I suspect that doing filemap_fdatawrite_range_kick() after every DONTCACHE write is the main problem on the write side. We probably need to come up with a way to make it flush more optimally when there are streaming DONTCACHE writes. An async writebehind window could be a solution. How would we implement that? Some sort of delay before we kick off writeback (and hopefully for larger ranges)? > > > > That said, this is part of the reason I asked about both whether the > > > > client side write is STABLE and whether RWF_DONTCACHE on > > > > the server side. i.e. using either of those will trigger writeback > > > > on the serer side immediately; in the case of the former it will > > > > also complete before returning to the client and not require a > > > > subsequent COMMIT RPC to wait for server side IO completion... > > > > > > > > > > I need to go back and sniff traffic to be sure, but I'm fairly certain > > > the client is issuing regular UNSTABLE writes and following up with a > > > later COMMIT, at least for most of them. The occasional STABLE write > > > might end up getting through, but that should be fairly rare. > > > > Yeah, I don't think that's an issue given that only the server side > > is using RWF_DONTCACHE. The COMMIT will effectively just be a > > journal and/or device cache flush as all the dirty data has already > > been written back to storage.... > > FYI, most of Hammerspace RWF_DONTCACHE testing has been using O_DIRECT > for client IO and nfsd.nfsd_dontcache=Y on the server. Good to know. I'll switch my testing to O_DIRECT as well. The client- side pagecache isn't adding any benefit to this. -- Jeff Layton <jlayton@xxxxxxxxxx>