On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote: > On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote: > > On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote: > > > On 6/10/25 4:57 PM, Mike Snitzer wrote: > > > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data > > > > read or written by NFSD will either not be cached (thanks to O_DIRECT) > > > > or will be removed from the page cache upon completion (DONTCACHE). > > > > > > I thought we were going to do two switches: One for reads and one for > > > writes? I could be misremembering. > > > > We did discuss the possibility of doing that. Still can-do if that's > > what you'd prefer. > > > > Having them as separate controls in debugfs is fine for > experimentation's sake, but I imagine we'll need to be all-in one way > or the other with a real interface. > > I think if we can crack the problem of receiving WRITE payloads into an > already-aligned buffer, then that becomes much more feasible. I think > that's a solveable problem. You'd immediately be my hero! Let's get into it: In a previously reply to this thread you aptly detailed what I found out the hard way (with too much xdr_buf code review and tracing): On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote: > > > > NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to > > DIO alignment (both page and disk alignment). This works quite well > > for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it > > maps the WRITE payload into aligned pages. But more work is needed to > > be able to leverage O_DIRECT when SUNRPC's regular TCP transport is > > used. I spent quite a bit of time analyzing the existing xdr_buf code > > and NFSD's use of it. Unfortunately, the WRITE payload gets stored in > > misaligned pages such that O_DIRECT isn't possible without a copy > > (completely defeating the point). I'll reply to this cover letter to > > start a subthread to discuss how best to deal with misaligned write > > IO (by association with Hammerspace, I'm most interested in NFS v3). > > > > Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the > rq_pages array. To get alignment right, you'd probably have to do the > receive in a much more piecemeal way. > > Basically, you'd need to decode as you receive chunks of the message, > and look out for WRITEs, and then set it up so that their payloads are > received with proper alignment. 1) Yes, and while I arrived at the same exact conclusion I was left with dread about the potential for "breaking too many eggs to make that tasty omelette". If you (or others) see a way forward to have SUNRPC TCP's XDR receive "inline" decode (rather than have the 2 stage process you covered above) that'd be fantastic. Seems like really old tech-debt in SUNRPC from a time when such care about alignment of WRITE payload pages was completely off engineers' collective radar (owed to NFSD only using buffered IO I assume?). 2) One hack that I verified to work for READ and WRITE IO on my particular TCP testbed was to front-pad the first "head" page of the xdr_buf such that the WRITE payload started at the 2nd page of rq_pages. So that looked like this hack for my usage: diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index 8fc5b2b2d806..cf082a265261 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp) /* Make arg->head point to first page and arg->pages point to rest */ arg->head[0].iov_base = page_address(rqstp->rq_pages[0]); - arg->head[0].iov_len = PAGE_SIZE; + // FIXME: front-pad optimized to align TCP's WRITE payload + // but may not be enough for other operations? + arg->head[0].iov_len = 148; arg->pages = rqstp->rq_pages + 1; arg->page_base = 0; /* save at least one page for response */ That gut "but may not be enough for other operations?" comment proved to be prophetic. Sadly it went on to fail spectacularly for other ops (specifically READDIR and READDIRPLUS, probably others would too) because xdr_inline_decode() _really_ doesn't like going beyond the end of the xdr_buf's inline "head" page. It could be that even if xdr_inline_decode() et al was "fixed" (which isn't for the faint of heart given xdr_buf's more complex nature) there will likely be other mole(s) that pop up. And in addition, we'd be wasting space in the xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying to see this frontpad hack through to completion. 3) Lastly, for completeness, I also mentioned briefly in a previous recent reply: On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote: > On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote: > > > In any case, for now at least, unless you're using RDMA, it's going to > > end up falling back to buffered writes everywhere. The data is almost > > never going to be properly aligned coming in off the wire. That might > > be fixable though. > > Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA > over TCP to workaround SUNRPC TCP's XDR handling always storing the > write payload in misaligned IO. But that's purely a stop-gap > workaround, which needs testing (to see if soft-iwap negates the win > of using O_DIRECT, etc). (Ab)using soft-iwarp as the basis for easily getting page aligned TCP WRITE payloads seems pretty gross given we are chasing utmost performance, etc. All said, I welcome your sage advice and help on this effort to DIO-align SUNRPC TCP's WRITE payload pages. Thanks, Mike