On 9/4/25 12:33 PM, Mike Snitzer wrote: > On Thu, Sep 04, 2025 at 12:10:00PM -0400, Chuck Lever wrote: >> On 9/4/25 10:42 AM, Mike Snitzer wrote: >>> On Tue, Sep 02, 2025 at 05:27:11PM -0400, Mike Snitzer wrote: >>>> On Tue, Sep 02, 2025 at 05:16:10PM -0400, Chuck Lever wrote: >>>>> >>>>> I am testing with a physically separate client and server, so I believe >>>>> that LOCALIO is not in play. I do see WRITEs. And other workloads (in >>>>> particular "fsx -Z <fname>") show READ traffic and I'm getting the >>>>> new trace point to fire quite a bit, and it is showing misaligned >>>>> READ requests. So it has something to do with dt. >>>> >>>> OK, yeah I figured you weren't doing loopback mount, only thing that >>>> came to mind for you not seeing READ like expected. I haven't had any >>>> problems with dt not driving READs to NFSD... >>>> >>>> You'll certainly need to see READs in order for NFSD's new misaligned >>>> DIO READ handling to get tested. >>> >>> I was doing some additional testing of the v9 changes last night and >>> realized why you weren't seeing any READs come through to NFSD: >>> "flags=direct" must be added to the dt commandline. Otherwise it'll >>> use buffered IO at the client and the READ will be serviced by the >>> client's page cache. >>> >>> But like I said in another reply: when I just use v3 and RDMA (without >>> the intermediary of flexfiles at the client) I'm not able to see the >>> data mismatch with dt... >>> >>> So while its unlikely: does adding "flags=direct" cause dt to fail >>> when NFSD handles the misaligned DIO READ? >> Applied v9. >> >> Multiple successful runs, no failures after adding "flags=direct". >> Some excerpts from the last run show the server is seeing NFS >> READs now: >> >> Filesystem options: >> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard, >> fatal_neterrors=none,proto=rdma,port=20049,timeo=600,retrans=2, >> sec=sys,mountaddr=192.168.2.55,mountvers=3,mountproto=tcp, >> local_lock=none,addr=192.168.2.55 >> >> nfsd-1342 [004] 463.832928: nfsd_analyze_read_dio: xid=0x89784d89 >> fh_hash=0x024204eb offset=0 len=47008 start=0+0 middle=0+47008 end=47008+96 >> nfsd-1342 [004] 463.833105: nfsd_analyze_read_dio: xid=0x8a784d89 >> fh_hash=0x024204eb offset=47008 len=47008 start=46592+416 >> middle=47008+47008 end=94016+192 >> nfsd-1342 [004] 463.833185: nfsd_analyze_read_dio: xid=0x8b784d89 >> fh_hash=0x024204eb offset=94016 len=47008 start=93696+320 >> middle=94016+47008 end=141024+288 > > OK, thanks for testing! > > So yeah, patch 9/9 of v9 does workaround the problem relative to > flexfiles+RDMA (though patch header should really be updated to add > "flags=direct" to the dt command line): > https://lore.kernel.org/linux-nfs/20250903205121.41380-10-snitzer@xxxxxxxxxx/ > > Is it a tolerable intermediate workaround you'd be OK with? To be > clear, I'm continuing to work the problem (and will be discussing it > with Trond)... but its a tricky one for sure. 1/9 through 4/9 are merge-ready. Though I'm thinking maybe the DIRECT support should remain "ENOTSUPP" for the moment -- just add DONTCACHE and BUFFERED for now. For 5/9, I would like to continue improving that code. It will be easier and less risky if we do that before there are non-developer users of that code (ie, done before it is merged). I will spend some time on it to give some detailed feedback. 6/9, as we've discussed, is risky until we can gain more confidence that managing the unaligned ends via a buffered write is not going to result in corruption. So, not merge-ready. 7/9: I think we need to be smarter about the trace points. There are some exceptions (like where NFSD_IO_DIRECT is turned off for an I/O) that need either a trace point or a counter. The code paths are likely to change anyway as they are polished. So, I don't plan to merge at this time. 8/9 will need to be rewritten as the code evolves. We can wait to merge that. 9/9: I would rather wait for thorough root cause analysis. It doesn't make sense to me that picking the end page rather than the first page should make any difference at all. I like to have a little more meat on the rationale bone before merging fixes. And whatever is found, it needs to be squashed into 5/9. The "dt" reproducer is very low profile -- less than 20 operations on the wire for the non-pNFS case. IMO grabbing a network capture (on RoCE) would be helpful. -- Chuck Lever