Re: [RFC PATCH 1/2] NFSD: fix misaligned DIO READ to not use a start_extra_page, exposes rpcrdma bug?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/4/25 12:33 PM, Mike Snitzer wrote:
> On Thu, Sep 04, 2025 at 12:10:00PM -0400, Chuck Lever wrote:
>> On 9/4/25 10:42 AM, Mike Snitzer wrote:
>>> On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote:
>>>> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote:
>>>>>
>>>>> I am testing with a physically separate client and server, so I believe
>>>>> that LOCALIO is not in play. I do see WRITEs. And other workloads (in
>>>>> particular "fsx -Z <fname>") show READ traffic and I'm getting the
>>>>> new trace point to fire quite a bit, and it is showing misaligned
>>>>> READ requests. So it has something to do with dt.
>>>>
>>>> OK, yeah I figured you weren't doing loopback mount, only thing that
>>>> came to mind for you not seeing READ like expected.  I haven't had any
>>>> problems with dt not driving READs to NFSD...
>>>>
>>>> You'll certainly need to see READs in order for NFSD's new misaligned
>>>> DIO READ handling to get tested.
>>>
>>> I was doing some additional testing of the v9 changes last night and
>>> realized why you weren't seeing any READs come through to NFSD:
>>> "flags=direct" must be added to the dt commandline. Otherwise it'll
>>> use buffered IO at the client and the READ will be serviced by the
>>> client's page cache.
>>>
>>> But like I said in another reply: when I just use v3 and RDMA (without
>>> the intermediary of flexfiles at the client) I'm not able to see the
>>> data mismatch with dt...
>>>
>>> So while its unlikely: does adding "flags=direct" cause dt to fail
>>> when NFSD handles the misaligned DIO READ?
>> Applied v9.
>>
>> Multiple successful runs, no failures after adding "flags=direct".
>> Some excerpts from the last run show the server is seeing NFS
>> READs now:
>>
>> Filesystem options:
>>   rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,
>>   fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2,
>>   sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp,
>>   local_lock=none,addr=192.168.2.55
>>
>> nfsd-1342  [004]   463.832928: nfsd_analyze_read_dio: xid=0x89784d89
>> fh_hash=0x024204eb offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96
>> nfsd-1342  [004]   463.833105: nfsd_analyze_read_dio: xid=0x8a784d89
>> fh_hash=0x024204eb offset=47008 len=47008 start=46592+416
>> middle=47008+47008 end=94016+192
>> nfsd-1342  [004]   463.833185: nfsd_analyze_read_dio: xid=0x8b784d89
>> fh_hash=0x024204eb offset=94016 len=47008 start=93696+320
>> middle=94016+47008 end=141024+288
> 
> OK, thanks for testing!
> 
> So yeah, patch 9/9 of v9 does workaround the problem relative to
> flexfiles+RDMA (though patch header should really be updated to add
> "flags=direct" to the dt command line):
> https://lore.kernel.org/linux-nfs/20250903205121.41380-10-snitzer@xxxxxxxxxx/
> 
> Is it a tolerable intermediate workaround you'd be OK with?  To be
> clear, I'm continuing to work the problem (and will be discussing it
> with Trond)... but its a tricky one for sure.

1/9 through 4/9 are merge-ready. Though I'm thinking maybe the DIRECT
support should remain "ENOTSUPP" for the moment -- just add DONTCACHE
and BUFFERED for now.

For 5/9, I would like to continue improving that code. It will be easier
and less risky if we do that before there are non-developer users of
that code (ie, done before it is merged). I will spend some time on it
to give some detailed feedback.

6/9, as we've discussed, is risky until we can gain more confidence that
managing the unaligned ends via a buffered write is not going to result
in corruption. So, not merge-ready.

7/9: I think we need to be smarter about the trace points. There are
some exceptions (like where NFSD_IO_DIRECT is turned off for an I/O)
that need either a trace point or a counter. The code paths are likely
to change anyway as they are polished. So, I don't plan to merge at this
time.

8/9 will need to be rewritten as the code evolves. We can wait to merge
that.

9/9: I would rather wait for thorough root cause analysis. It doesn't
make sense to me that picking the end page rather than the first page
should make any difference at all. I like to have a little more meat on
the rationale bone before merging fixes.

And whatever is found, it needs to be squashed into 5/9.

The "dt" reproducer is very low profile -- less than 20 operations on
the wire for the non-pNFS case. IMO grabbing a network capture (on
RoCE) would be helpful.


-- 
Chuck Lever




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux