Re: [PATCH v8 5/7] NFSD: issue READs using O_DIRECT even if IO is misaligned

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 28 Aug 2025 10:53:49 -0400

On 8/28/25 4:09 AM, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 09:57:39PM -0400, Chuck Lever wrote:
>> On 8/27/25 7:15 PM, Mike Snitzer wrote:
>>> On Wed, Aug 27, 2025 at 04:56:08PM -0400, Chuck Lever wrote:
>>>> On 8/27/25 3:41 PM, Mike Snitzer wrote:
>>>>> Is your suggestion to, rather than allocate a disjoint single page,
>>>>> borrow the extra page from the end of rq_pages? Just map it into the
>>>>> bvec instead of my extra page?
>>>>
>>>> Yes, the extra page needs to come from rq_pages. But I don't see why it
>>>> should come from the /end/ of rq_pages.
>>>>
>>>> - Extend the start of the byte range back to make it align with the
>>>>   file's DIO alignment constraint
>>>>
>>>> - Extend the end of the byte range forward to make it align with the
>>>>   file's DIO alignment constraint
>>>
>>> nfsd_analyze_read_dio() does that (start_extra and end_extra).
>>>
>>>> - Fill in the sink buffer's bvec using pages from rq_pages, as usual
>>>>
>>>> - When the I/O is complete, adjust the offset in the first bvec entry
>>>>   forward by setting a non-zero page offset, and adjust the returned
>>>>   count downward to match the requested byte count from the client
>>>
>>> Tried it long ago, such bvec manipulation only works when not using
>>> RDMA.  When the memory is remote, twiddling a local bvec isn't going
>>> to ensure the correct pages have the correct data upon return to the
>>> client.
>>>
>>> RDMA is why the pages must be used in-place, and RDMA is also why
>>> the extra page needed by this patch (for use as throwaway front-pad
>>> for expanded misaligned DIO READ) must either be allocated _or_
>>> hopefully it can be from rq_pages (after the end of the client
>>> requested READ payload).
>>>
>>> Or am I wrong and simply need to keep learning about NFSD's IO path?
>>
>> You're wrong, not to put a fine point on it.
> 
> You didn't even understand me.. but firmly believe I'm wrong?
> 
>> There's nothing I can think of in the RDMA or RPC/RDMA protocols that
>> mandates that the first page offset must always be zero. Moving data
>> at one address on the server to an entirely different address and
>> alignment on the client is exactly what RDMA is supposed to do.
>>
>> It sounds like an implementation omission because the server's upper
>> layers have never needed it before now. If TCP already handles it, I'm
>> guessing it's going to be straightforward to fix.
> 
> I never said that first page offset must be zero.  I said that I
> already did what you suggested and it didn't work with RDMA.  This is
> recall of too many months ago now, but: the client will see the
> correct READ payload _except_ IIRC it is offset by whatever front-pad
> was added to expand the misaligned DIO; no matter whether
> rqstp->rq_bvec updated when IO completes.
> 
> But I'll revisit it again.

For the record, this email thread is the very first time I've heard that
you tried the simple approach and that it worked with TCP and not with
RDMA. I wish I had known that a while ago.

>>>>> NFSD using DIO is optional. I thought the point was to get it as an
>>>>> available option so that _others_ could experiment and help categorize
>>>>> the benefits/pitfalls further?
>>>>
>>>> Yes, that is the point. But such experiments lose value if there is no
>>>> data collection plan to go with them.
>>>
>>> Each user runs something they care about performing well and they
>>> measure the result.
>>
>> That assumes the user will continue to use the debug interfaces, and
>> the particular implementation you've proposed, for the rest of time.
>> And that's not my plan at all.
>>
>> If we, in the community, cannot reproduce that result, or cannot
>> understand what has been measured, or the measurement misses part or
>> most of the picture, of what value is that for us to decide whether and
>> how to proceed with promoting the mechanism from debug feature to
>> something with a long-term support lifetime and a documented ABI-stable
>> user interface?
> 
> I'll work to put a finer point on how to reproduce and enumerate the
> things to look for (representative flamegraphs showing the issue,
> which I already did at last Bakeathon).
> 
> But I have repeatedly offered that the pathological worst case is
> client doing sequential write IO of a file that is 3-4x larger than
> the NFS server's system memory.
> 
> Large memory systems with 8 or more NVMe devices, fast networks that
> allow for huge data ingest capabilities.  These are the platforms that
> showcase MM's dirty writeback limitions when large sequential IO is
> initiated from the NFS client and its able to overrun the NFS server.
> 
> In addition, in general DIO requires significantly less memory and
> CPU; so platforms that have more limited resources (and may have
> historically struggled) could have a new lease on life if they switch
> NFSD from buffered to DIO mode.
> 
>>> Literally the same thing as has been done for anything in Linux since
>>> it all started.  Nothing unicorn or bespoke here.
>>
>> So let me ask this another way: What do we need users to measure to give
>> us good quality information about the page cache behavior and system
>> thrashing behavior you reported?
> 
> IO throughput, CPU and memory usage should be monitored over time.

My point is users need a recipe for what to monitor, ie clear specific
instructions, and that recipe needs to be applied the same way by every
experimenter (because, as I say below, we too are collecting data about
the user's experience with this feature).

Can you provide a few specific instructions that experimenters can
follow on how to monitor and report these metrics? Simply copy-paste
what you and your test team have been using to observe system behavior.

>> For example: I can enable direct I/O on NFSD, but my workload is mostly
>> one or two clients doing kernel builds. The latency of NFS READs goes
>> up, but since a kernel build is not I/O bound and the client page caches
>> hide most of the increase, there is very little to show a measured
>> change.
>>
>> So how should I assess and report the impact of NFSD doing direct I/O?
> 
> Your underwhelming usage isn't what this patchset is meant to help.

Again, my point is, how are users who try the debug option going to
know that their workload is or is not of interest? We're not
providing such documentation. Users are just going to turn this on and
try it.

But consider that one of our options is to make direct I/O the default.
We need to know how smaller workloads might be impacted by this new
default. The maintainers are part of this experiment as much as
potential users are. We need to collect data to understand how to make
this work part of the first world administrative API.

>> See -- users are not the only ones who are involved in this experiment;
>> and they will need guidance because we're not providing any
>> documentation for this feature.
> 
> Users are not created equal.  Major companies like Oracle and Meta
> _should_ be aware of NFSD's problems with buffered IO.  They have
> internal and external stakeholders that are power users.
> 
> Jeff, does Meta ever see NFSD struggle to consistently use NVMe
> devices?  Lumpy performance?  Full-blown IO stalls?  Lots of NFSD
> threads hung in D state?

I have no idea why you think I don't understand those facts, but you are
still missing my point. NFSD maintainers are part of this experimental
set up as much as users are. We need good information to decide how to
shape this feature going forward. Let's build a plan to get that
information.

>>>> If you would rather make this drive-by, then you'll have to realize
>>>> that you are requesting more than simple review from us. You'll have
>>>> to be content with the pace at which us overloaded maintainers can get
>>>> to the work.
>>>
>>> I think I just experienced the mailing-list equivalent of the Detroit
>>> definition of "drive-by".  Good/bad news: you're a terrible shot.
>>
>> The term "drive-by contribution" has a well-understood meaning in the
>> kernel community. If you are unfamiliar with it, I invite you to review
>> the mailing list archives. As always, no-one is shooting at you. If
>> anything, the drive-by contribution is aimed at me.
> 
> It is a blatant miscategorization here. That you just doubled down
> on it having relevance in this instance is flagrantly wrong.

First of all, "drive-by contribution" is not meant to be pejorative.
It is what it is: someone throws you a patch and then goes away. It is
a fact of life for open source maintainers.

I'm simply trying to get a sense of your commitment to finishing this
work and staying with it going forward. You keep telling me "well I have
other things to do." What do you think that indicates to a busy
maintainer? That you will be around to finish the feature? Or rather
that your employer is going to yank you onto other projects, leaving us
stuck with the work of finishing it?

You have rejected one or more reasonable review comments and maintainer
requests. That doesn't convince me you are willing to collaborate with
us on maintaining the feature.

My biggest concern here is how much more work maintaining NFSD will be
once this change goes in. It's a major feature and a significant change.
It will require a lot of code massage in the short- and medium-term,
just as a start.

Maybe I'm assuming that you, as a kernel maintainer yourself, already
understand that this mind set and this calculus is part of my email replies.

So, I'm being frank. I'm not trying to offend or belittle.

> Whatever compells you to belittle me and my contributions, just know
> it is extremely hard to take. Highly unproductive and unprofessional.
I can't control how you understand my emails. If you choose to be
offended even though I'm trying to have an honest discussion, there's
nothing I can do about it.

I can't control /if/ you understand my emails. I'm repeating myself
quite a bit here because a lot of what I say sails past you. You read
things into what I say that I don't intend, and you miss a lot of what
I did intend. There's nothing I can do about that either.

I can't control your unproductive and unprofessional behavior. You keep
rejecting valid review comments and maintainer requests with comments
like "I don't feel like it" and "your reason for wanting this change is
invalid" and "you are wasting my time". Again, this suggests to me that
maintaining this feature will be a lot more work than it could be, and
that is an ongoing concern.

If there's something that offends you, the professional response on your
part is to point that out to me in the moment rather than trying to spit
back. Because most likely, I wasn't intending to offend at all. If you
keep that in mind while reading my emails, you might have an easier time
of it.

-- 
Chuck Lever