On Thu, 2025-08-28 at 04:09 -0400, Mike Snitzer wrote: > On Wed, Aug 27, 2025 at 09:57:39PM -0400, Chuck Lever wrote: > > On 8/27/25 7:15 PM, Mike Snitzer wrote: > > > On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote: > > > > On 8/27/25 3:41 PM, Mike Snitzer wrote: > > > > > Is your suggestion to, rather than allocate a disjoint single page, > > > > > borrow the extra page from the end of rq_pages? Just map it into the > > > > > bvec instead of my extra page? > > > > > > > > Yes, the extra page needs to come from rq_pages. But I don't see why it > > > > should come from the /end/ of rq_pages. > > > > > > > > - Extend the start of the byte range back to make it align with the > > > > file's DIO alignment constraint > > > > > > > > - Extend the end of the byte range forward to make it align with the > > > > file's DIO alignment constraint > > > > > > nfsd_analyze_read_dio() does that (start_extra and end_extra). > > > > > > > - Fill in the sink buffer's bvec using pages from rq_pages, as usual > > > > > > > > - When the I/O is complete, adjust the offset in the first bvec entry > > > > forward by setting a non-zero page offset, and adjust the returned > > > > count downward to match the requested byte count from the client > > > > > > Tried it long ago, such bvec manipulation only works when not using > > > RDMA. When the memory is remote, twiddling a local bvec isn't going > > > to ensure the correct pages have the correct data upon return to the > > > client. > > > > > > RDMA is why the pages must be used in-place, and RDMA is also why > > > the extra page needed by this patch (for use as throwaway front-pad > > > for expanded misaligned DIO READ) must either be allocated _or_ > > > hopefully it can be from rq_pages (after the end of the client > > > requested READ payload). > > > > > > Or am I wrong and simply need to keep learning about NFSD's IO path? > > > > You're wrong, not to put a fine point on it. > > You didn't even understand me.. but firmly believe I'm wrong? > > > There's nothing I can think of in the RDMA or RPC/RDMA protocols that > > mandates that the first page offset must always be zero. Moving data > > at one address on the server to an entirely different address and > > alignment on the client is exactly what RDMA is supposed to do. > > > > It sounds like an implementation omission because the server's upper > > layers have never needed it before now. If TCP already handles it, I'm > > guessing it's going to be straightforward to fix. > > I never said that first page offset must be zero. I said that I > already did what you suggested and it didn't work with RDMA. This is > recall of too many months ago now, but: the client will see the > correct READ payload _except_ IIRC it is offset by whatever front-pad > was added to expand the misaligned DIO; no matter whether > rqstp->rq_bvec updated when IO completes. > > But I'll revisit it again. > > > > > > NFSD using DIO is optional. I thought the point was to get it as an > > > > > available option so that _others_ could experiment and help categorize > > > > > the benefits/pitfalls further? > > > > > > > > Yes, that is the point. But such experiments lose value if there is no > > > > data collection plan to go with them. > > > > > > Each user runs something they care about performing well and they > > > measure the result. > > > > That assumes the user will continue to use the debug interfaces, and > > the particular implementation you've proposed, for the rest of time. > > And that's not my plan at all. > > > > If we, in the community, cannot reproduce that result, or cannot > > understand what has been measured, or the measurement misses part or > > most of the picture, of what value is that for us to decide whether and > > how to proceed with promoting the mechanism from debug feature to > > something with a long-term support lifetime and a documented ABI-stable > > user interface? > > I'll work to put a finer point on how to reproduce and enumerate the > things to look for (representative flamegraphs showing the issue, > which I already did at last Bakeathon). > > But I have repeatedly offered that the pathological worst case is > client doing sequential write IO of a file that is 3-4x larger than > the NFS server's system memory. > > Large memory systems with 8 or more NVMe devices, fast networks that > allow for huge data ingest capabilities. These are the platforms that > showcase MM's dirty writeback limitions when large sequential IO is > initiated from the NFS client and its able to overrun the NFS server. > > In addition, in general DIO requires significantly less memory and > CPU; so platforms that have more limited resources (and may have > historically struggled) could have a new lease on life if they switch > NFSD from buffered to DIO mode. > > > > Literally the same thing as has been done for anything in Linux since > > > it all started. Nothing unicorn or bespoke here. > > > > So let me ask this another way: What do we need users to measure to give > > us good quality information about the page cache behavior and system > > thrashing behavior you reported? > > IO throughput, CPU and memory usage should be monitored over time. > > > For example: I can enable direct I/O on NFSD, but my workload is mostly > > one or two clients doing kernel builds. The latency of NFS READs goes > > up, but since a kernel build is not I/O bound and the client page caches > > hide most of the increase, there is very little to show a measured > > change. > > > > So how should I assess and report the impact of NFSD doing direct I/O? > > Your underwhelming usage isn't what this patchset is meant to help. > > > See -- users are not the only ones who are involved in this experiment; > > and they will need guidance because we're not providing any > > documentation for this feature. > > Users are not created equal. Major companies like Oracle and Meta > _should_ be aware of NFSD's problems with buffered IO. They have > internal and external stakeholders that are power users. > > Jeff, does Meta ever see NFSD struggle to consistently use NVMe > devices? Lumpy performance? Full-blown IO stalls? Lots of NFSD > threads hung in D state? > Yes. We're particularly interested in this work for that reason. A lot of the workload is large, streaming writes at the application layer that are only rarely ever read, and quite a bit later when it does happen. This means that the pagecache is pretty useless. My _guess_ is that DIO will help that significantly, though I do still have some concerns about using buffered I/O for the edges of unaligned WRITEs. > > > > If you would rather make this drive-by, then you'll have to realize > > > > that you are requesting more than simple review from us. You'll have > > > > to be content with the pace at which us overloaded maintainers can get > > > > to the work. > > > > > > I think I just experienced the mailing-list equivalent of the Detroit > > > definition of "drive-by". Good/bad news: you're a terrible shot. > > > > The term "drive-by contribution" has a well-understood meaning in the > > kernel community. If you are unfamiliar with it, I invite you to review > > the mailing list archives. As always, no-one is shooting at you. If > > anything, the drive-by contribution is aimed at me. > > It is a blatant miscategorization here. That you just doubled down > on it having relevance in this instance is flagrantly wrong. > > Whatever compells you to belittle me and my contributions, just know > it is extremely hard to take. Highly unproductive and unprofessional. > > Boom, done. -- Jeff Layton <jlayton@xxxxxxxxxx>