Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 28 Aug 2025 12:36:20 -0400

On Thu, 2025-08-28 at 04:09 -0400, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 09:57:39PM -0400, Chuck Lever wrote:
> > On 8/27/25 7:15 PM, Mike Snitzer wrote:
> > > On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
> > > > On 8/27/25 3:41 PM, Mike Snitzer wrote:
> > > > > Is your suggestion to, rather than allocate a disjoint single page,
> > > > > borrow the extra page from the end of rq_pages? Just map it into the
> > > > > bvec instead of my extra page?
> > > > 
> > > > Yes, the extra page needs to come from rq_pages. But I don't see why it
> > > > should come from the /end/ of rq_pages.
> > > > 
> > > > - Extend the start of the byte range back to make it align with the
> > > >   file's DIO alignment constraint
> > > > 
> > > > - Extend the end of the byte range forward to make it align with the
> > > >   file's DIO alignment constraint
> > > 
> > > nfsd_analyze_read_dio() does that (start_extra and end_extra).
> > > 
> > > > - Fill in the sink buffer's bvec using pages from rq_pages, as usual
> > > > 
> > > > - When the I/O is complete, adjust the offset in the first bvec entry
> > > >   forward by setting a non-zero page offset, and adjust the returned
> > > >   count downward to match the requested byte count from the client
> > > 
> > > Tried it long ago, such bvec manipulation only works when not using
> > > RDMA.  When the memory is remote, twiddling a local bvec isn't going
> > > to ensure the correct pages have the correct data upon return to the
> > > client.
> > > 
> > > RDMA is why the pages must be used in-place, and RDMA is also why
> > > the extra page needed by this patch (for use as throwaway front-pad
> > > for expanded misaligned DIO READ) must either be allocated _or_
> > > hopefully it can be from rq_pages (after the end of the client
> > > requested READ payload).
> > > 
> > > Or am I wrong and simply need to keep learning about NFSD's IO path?
> > 
> > You're wrong, not to put a fine point on it.
> 
> You didn't even understand me.. but firmly believe I'm wrong?
> 
> > There's nothing I can think of in the RDMA or RPC/RDMA protocols that
> > mandates that the first page offset must always be zero. Moving data
> > at one address on the server to an entirely different address and
> > alignment on the client is exactly what RDMA is supposed to do.
> > 
> > It sounds like an implementation omission because the server's upper
> > layers have never needed it before now. If TCP already handles it, I'm
> > guessing it's going to be straightforward to fix.
> 
> I never said that first page offset must be zero.  I said that I
> already did what you suggested and it didn't work with RDMA.  This is
> recall of too many months ago now, but: the client will see the
> correct READ payload _except_ IIRC it is offset by whatever front-pad
> was added to expand the misaligned DIO; no matter whether
> rqstp->rq_bvec updated when IO completes.
> 
> But I'll revisit it again.
> 
> > > > > NFSD using DIO is optional. I thought the point was to get it as an
> > > > > available option so that _others_ could experiment and help categorize
> > > > > the benefits/pitfalls further?
> > > > 
> > > > Yes, that is the point. But such experiments lose value if there is no
> > > > data collection plan to go with them.
> > > 
> > > Each user runs something they care about performing well and they
> > > measure the result.
> > 
> > That assumes the user will continue to use the debug interfaces, and
> > the particular implementation you've proposed, for the rest of time.
> > And that's not my plan at all.
> > 
> > If we, in the community, cannot reproduce that result, or cannot
> > understand what has been measured, or the measurement misses part or
> > most of the picture, of what value is that for us to decide whether and
> > how to proceed with promoting the mechanism from debug feature to
> > something with a long-term support lifetime and a documented ABI-stable
> > user interface?
> 
> I'll work to put a finer point on how to reproduce and enumerate the
> things to look for (representative flamegraphs showing the issue,
> which I already did at last Bakeathon).
> 
> But I have repeatedly offered that the pathological worst case is
> client doing sequential write IO of a file that is 3-4x larger than
> the NFS server's system memory.
> 
> Large memory systems with 8 or more NVMe devices, fast networks that
> allow for huge data ingest capabilities.  These are the platforms that
> showcase MM's dirty writeback limitions when large sequential IO is
> initiated from the NFS client and its able to overrun the NFS server.
> 
> In addition, in general DIO requires significantly less memory and
> CPU; so platforms that have more limited resources (and may have
> historically struggled) could have a new lease on life if they switch
> NFSD from buffered to DIO mode.
> 
> > > Literally the same thing as has been done for anything in Linux since
> > > it all started.  Nothing unicorn or bespoke here.
> > 
> > So let me ask this another way: What do we need users to measure to give
> > us good quality information about the page cache behavior and system
> > thrashing behavior you reported?
> 
> IO throughput, CPU and memory usage should be monitored over time.
> 
> > For example: I can enable direct I/O on NFSD, but my workload is mostly
> > one or two clients doing kernel builds. The latency of NFS READs goes
> > up, but since a kernel build is not I/O bound and the client page caches
> > hide most of the increase, there is very little to show a measured
> > change.
> > 
> > So how should I assess and report the impact of NFSD doing direct I/O?
> 
> Your underwhelming usage isn't what this patchset is meant to help.
> 
> > See -- users are not the only ones who are involved in this experiment;
> > and they will need guidance because we're not providing any
> > documentation for this feature.
> 
> Users are not created equal.  Major companies like Oracle and Meta
> _should_ be aware of NFSD's problems with buffered IO.  They have
> internal and external stakeholders that are power users.
> 
> Jeff, does Meta ever see NFSD struggle to consistently use NVMe
> devices?  Lumpy performance?  Full-blown IO stalls?  Lots of NFSD
> threads hung in D state?
> 

Yes. We're particularly interested in this work for that reason. A lot
of the workload is large, streaming writes at the application layer
that are only rarely ever read, and quite a bit later when it does
happen.

This means that the pagecache is pretty useless. My _guess_ is that DIO
will help that significantly, though I do still have some concerns
about using buffered I/O for the edges of unaligned WRITEs.

> > > > If you would rather make this drive-by, then you'll have to realize
> > > > that you are requesting more than simple review from us. You'll have
> > > > to be content with the pace at which us overloaded maintainers can get
> > > > to the work.
> > > 
> > > I think I just experienced the mailing-list equivalent of the Detroit
> > > definition of "drive-by".  Good/bad news: you're a terrible shot.
> > 
> > The term "drive-by contribution" has a well-understood meaning in the
> > kernel community. If you are unfamiliar with it, I invite you to review
> > the mailing list archives. As always, no-one is shooting at you. If
> > anything, the drive-by contribution is aimed at me.
> 
> It is a blatant miscategorization here. That you just doubled down
> on it having relevance in this instance is flagrantly wrong.
> 
> Whatever compells you to belittle me and my contributions, just know
> it is extremely hard to take. Highly unproductive and unprofessional.
> 
> Boom, done.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>