Re: performance r nfsd with RWF_DONTCACHE and larger wsizes

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 8 May 2025 12:05:36 +1000

On Wed, May 07, 2025 at 08:09:33PM -0400, Jeff Layton wrote:
> On Wed, 2025-05-07 at 17:50 -0400, Mike Snitzer wrote:
> > Hey Dave,
> > 
> > Thanks for providing your thoughts on all this.  More inlined below.
> > 
> > On Wed, May 07, 2025 at 12:50:20PM +1000, Dave Chinner wrote:
> > > Remember the bad old days of balance_dirty_pages() doing dirty
> > > throttling by submitting dirty pages for IO directly in the write()
> > > context? And how much better buffered write performance and write()
> > > submission latency became when we started deferring that IO to the
> > > writeback threads and waiting on completions?
> > > 
> > > We're essentially going back to the bad old days with buffered
> > > RWF_DONTCACHE writes. Instead of one nicely formed background
> > > writeback stream that can be throttled at the block layer without
> > > adversely affecting incoming write throughput, we end up with every
> > > write() context submitting IO synchronously and being randomly
> > > throttled by the block layer throttle....
> > > 
> > > There are a lot of reasons the current RWF_DONTCACHE implementation
> > > is sub-optimal for common workloads. This IO spraying and submission
> > > side throttling problem
> > > is one of the reasons why I suggested very early on that an async
> > > write-behind window (similar in concept to async readahead winodws)
> > > would likely be a much better generic solution for RWF_DONTCACHE
> > > writes. This would retain the "one nicely formed background
> > > writeback stream" behaviour that is desirable for buffered writes,
> > > but still allow in rapid reclaim of DONTCACHE folios as IO cleans
> > > them...
> > 
> > I recall you voicing this concern and nobody really seizing on it.
> > Could be that Jens is open changing the RWF_DONTCACHE implementation
> > if/when more proof is made for the need?
> 
> It does seem like using RWF_DONTCACHE currently leads to a lot of
> fragmented I/O. I suspect that doing filemap_fdatawrite_range_kick()
> after every DONTCACHE write is the main problem on the write side. We
> probably need to come up with a way to make it flush more optimally
> when there are streaming DONTCACHE writes.
> 
> An async writebehind window could be a solution. How would we implement
> that? Some sort of delay before we kick off writeback (and hopefully
> for larger ranges)?

My thoughts on this are as follows...

When we mark the inode dirty, we currently put it on the list of
dirty inodes for writeback. We could change how we mark an inode
dirty for RWF_DONTCACHE writes to say "dirty for write-through" and
put it on a new write-through inode list.  Then we can kick an expedited
write-through worker thread that writes back all the dirty
write-through inodes on it's list.

In this case, a delay of a few milliseconds is probably large enough
time to allow decent write-through IO sizes to build up without
causing excessive page cache memory usage for dirty DONTCACHE
folios...

The other thing that this could allow is throttling incoming
RWF_DONTCACHE IOs in balance_dirty_pages_ratelimited. e.g. if more
than 16MB of DONTCACHE folios are built up on a BDI, kick the
write-through worker and wait for the DONTCACHE folio count to drop.
This then gives some control (and potential admin control) of how
much dirty page cache is allowed to accrue for DONTCACHE write
IOs...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx