Re: [PATCH v3 0/5] NFSD: add "NFSD DIRECT" and "NFSD DONTCACHE" IO modes

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 15 Jul 2025 11:59:17 -0400

On 7/15/25 10:50 AM, Mike Snitzer wrote:
> On Tue, Jul 15, 2025 at 09:59:05AM -0400, Chuck Lever wrote:
>> Hi Mike,
>>
>> There are a lot of speculative claims here. I would prefer that the
>> motivation for this work focus on the workload that is actually
>> suffering from the added layer of cache, rather than making some
>> claim that "hey, this change is good for all taxpayers!" ;-)
> 
> Really not sure what you're referring to.  I didn't make any
> speculative claims...
> 
>> On 7/14/25 6:42 PM, Mike Snitzer wrote:
>>> Hi,
>>>
>>> Summary (by Jeff Layton [0]):
>>> "The basic problem is that the pagecache is pretty useless for
>>> satisfying READs from nfsd.
>>
>> A bold claim like this needs to be backed up with careful benchmark
>> results.
>>
>> But really, the actual problem that you are trying to address is that,
>> for /your/ workloads, the server's page cache is not useful and can be
>> counter productive when the server's working set is larger than its RAM.
>>
>> So, I would replace this sentence.
> 
> Oh, you are referring to Jeff's previous summary.  Noted! ;)
> 
>>> Most NFS workloads don't involve I/O to
>>> the same files from multiple clients. The client ends up having most
>>> of the data in its cache already and only very rarely do we need to
>>> revisit the data on the server.
>>
>> Maybe it would be better to say:
>>
>> "Common NFS workloads do not involve shared files, and client working
>> sets can comfortably fit in each client's page cache."
>>
>> And then add a description of the workload you are trying to optimize.
> 
> Sure, certainly can/will do for v4 (if/when v4 needed).
>  
>>> At the same time, it's really easy to overwhelm the storage with
>>> pagecache writeback with modern memory sizes.
>>
>> Again, perhaps this isn't quite accurate? The problem is not only the
>> server's memory size; it's that the server doesn't start writeback soon
>> enough, writes back without parallelism, and does not handle thrashing
>> very well. This is very likely due to the traditional Linux design
>> that makes writeback lazy (in the computer science sense of "lazy"),
>> assuming that if the working set does not fit in memory, then you should
>> simply purchase more RAM.
>>
>>
>>> Having nfsd bypass the
>>> pagecache altogether is potentially a huge performance win, if it can
>>> be made to work safely."
>>
>> Then finally, "Therefore, we provide the option to make I/O avoid the
>> NFS server's page cache, as an experiment." Which I hope is somewhat
>> less alarming to folks who still rely on the server's page cache.
> 
> I can tighten it up respecting/including your feedback.  0th patch
> header aside, are you wanting this included somewhere in Documentation?

Nothing in a fixed Documentation file, at least until we start nailing
down the new per-export administrative interfaces.

> (if it were to be part of Documentation you'd then be welcome to
> refine it as you see needed, but I can take a stab at laying down a
> starting point)

You are in full control of the cover letter, of course. I wanted to
point out where I thought the purpose of this work might differ a little
from what is advertised in this cover letter, which is currently the
only record of the rationale for the series.

>>> The performance win associated with using NFSD DIRECT was previously
>>> summarized here:
>>> https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@xxxxxxxxxx/
>>> This picture offers a nice summary of performance gains:
>>> https://original.art/NFSD_direct_vs_buffered_IO.jpg
>>>
>>> This v3 series was developed ontop of Chuck's nfsd_testing which has 2
>>> patches that saw fh_getattr() moved, etc (v2 of this series included
>>> those patches but since they got review during v2 and Chuck already
>>> has them staged in nfsd-testing I didn't think it made sense to keep
>>> them included in this v3).
>>>
>>> Changes since v2 include:
>>> - explored suggestion to use string based interface (e.g. "direct"
>>>   instead of 3) but debugfs seems to only supports numeric values.
>>> - shifted numeric values for debugfs interface from 0-2 to 1-3 and
>>>   made 0 UNSPECIFIED (which is the default)
>>> - if user specifies io_cache_read or io_cache_write mode other than 1,
>>>   2 or 3 (via debugfs) they will get an error message
>>> - pass a data structure to nfsd_analyze_read_dio rather than so many
>>>   in/out params
>>> - improved comments as requested (e.g. "Must remove first
>>>   start_extra_page from rqstp->rq_bvec" was reworked)
>>> - use memmove instead of opencoded shift in
>>>   nfsd_complete_misaligned_read_dio
>>> - dropped the still very important "lib/iov_iter: remove piecewise
>>>   bvec length checking in iov_iter_aligned_bvec" patch because it
>>>   needs to be handled separately.
>>> - various other changes to improve code
>>>
>>> Thanks,
>>> Mike
>>>
>>> [0]: https://lore.kernel.org/linux-nfs/b1accdad470f19614f9d3865bb3a4c69958e5800.camel@xxxxxxxxxx/
>>>
>>> Mike Snitzer (5):
>>>   NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
>>>   NFSD: pass nfsd_file to nfsd_iter_read()
>>>   NFSD: add io_cache_read controls to debugfs interface
>>>   NFSD: add io_cache_write controls to debugfs interface
>>>   NFSD: issue READs using O_DIRECT even if IO is misaligned
>>>
>>>  fs/nfsd/debugfs.c          | 102 +++++++++++++++++++
>>>  fs/nfsd/filecache.c        |  32 ++++++
>>>  fs/nfsd/filecache.h        |   4 +
>>>  fs/nfsd/nfs4xdr.c          |   8 +-
>>>  fs/nfsd/nfsd.h             |  10 ++
>>>  fs/nfsd/nfsfh.c            |   4 +
>>>  fs/nfsd/trace.h            |  37 +++++++
>>>  fs/nfsd/vfs.c              | 197 ++++++++++++++++++++++++++++++++++---
>>>  fs/nfsd/vfs.h              |   2 +-
>>>  include/linux/sunrpc/svc.h |   5 +-
>>>  10 files changed, 383 insertions(+), 18 deletions(-)
>>>
>>
>> The series is beginning to look clean to me, and we have introduced
>> several simple but effective clean-ups along the way.
> 
> Thanks.
> 
>> My only concern is that we're making the read path more complex rather
>> than less. (This isn't a new concern; I have wanted to make reads
>> simpler by, say, removing splice support, for quite a while, as you
>> know). I'm hoping that, once the experiment has "concluded", we find
>> ways of simplifying the code and the administrative interface. (That
>> is not an objection. call it a Future Work comment).
> 
> Yeah, READ path does get more complex but less so than before my
> having factored code out to a couple methods... open to any cleanup
> suggestions to run with as "Future Work".  I think the pivot from
> debugfs to per-export controls will be perfect opportunity to polish.
> 
>> Also, a remaining open question is how we want to deal with READ_PLUS
>> and holes.
> 
> Hmm, not familiar with this.. I'll have a look.  But if you have
> anything further on this point please share.

Currently I don't think we need to deal with it in this patch set. But
note that NFSv4.2 READ_PLUS can return a map of unallocated areas in a
file. We should think a little about whether additional logic is needed
when using O_DIRECT READs.

-- 
Chuck Lever