On 12 Jun 2025, at 9:28, Chuck Lever wrote: > On 6/12/25 6:28 AM, Jeff Layton wrote: >> On Wed, 2025-06-11 at 17:36 -0400, Mike Snitzer wrote: >>> On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote: >>>> On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote: >>>>> On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote: >>>>>> On 6/10/25 4:57 PM, Mike Snitzer wrote: >>>>>>> Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data >>>>>>> read or written by NFSD will either not be cached (thanks to O_DIRECT) >>>>>>> or will be removed from the page cache upon completion (DONTCACHE). >>>>>> >>>>>> I thought we were going to do two switches: One for reads and one for >>>>>> writes? I could be misremembering. >>>>> >>>>> We did discuss the possibility of doing that. Still can-do if that's >>>>> what you'd prefer. >>>>> >>>> >>>> Having them as separate controls in debugfs is fine for >>>> experimentation's sake, but I imagine we'll need to be all-in one way >>>> or the other with a real interface. >>>> >>>> I think if we can crack the problem of receiving WRITE payloads into an >>>> already-aligned buffer, then that becomes much more feasible. I think >>>> that's a solveable problem. >>> >>> You'd immediately be my hero! Let's get into it: >>> >>> In a previously reply to this thread you aptly detailed what I found >>> out the hard way (with too much xdr_buf code review and tracing): >>> >>> On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote: >>>>> >>>>> NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to >>>>> DIO alignment (both page and disk alignment). This works quite well >>>>> for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it >>>>> maps the WRITE payload into aligned pages. But more work is needed to >>>>> be able to leverage O_DIRECT when SUNRPC's regular TCP transport is >>>>> used. I spent quite a bit of time analyzing the existing xdr_buf code >>>>> and NFSD's use of it. Unfortunately, the WRITE payload gets stored in >>>>> misaligned pages such that O_DIRECT isn't possible without a copy >>>>> (completely defeating the point). I'll reply to this cover letter to >>>>> start a subthread to discuss how best to deal with misaligned write >>>>> IO (by association with Hammerspace, I'm most interested in NFS v3). >>>>> >>>> >>>> Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the >>>> rq_pages array. To get alignment right, you'd probably have to do the >>>> receive in a much more piecemeal way. >>>> >>>> Basically, you'd need to decode as you receive chunks of the message, >>>> and look out for WRITEs, and then set it up so that their payloads are >>>> received with proper alignment. >>> >>> 1) >>> Yes, and while I arrived at the same exact conclusion I was left with >>> dread about the potential for "breaking too many eggs to make that >>> tasty omelette". >>> >>> If you (or others) see a way forward to have SUNRPC TCP's XDR receive >>> "inline" decode (rather than have the 2 stage process you covered >>> above) that'd be fantastic. Seems like really old tech-debt in SUNRPC >>> from a time when such care about alignment of WRITE payload pages was >>> completely off engineers' collective radar (owed to NFSD only using >>> buffered IO I assume?). >>> >>> 2) >>> One hack that I verified to work for READ and WRITE IO on my >>> particular TCP testbed was to front-pad the first "head" page of the >>> xdr_buf such that the WRITE payload started at the 2nd page of >>> rq_pages. So that looked like this hack for my usage: >>> >>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c >>> index 8fc5b2b2d806..cf082a265261 100644 >>> --- a/net/sunrpc/svc_xprt.c >>> +++ b/net/sunrpc/svc_xprt.c >>> @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp) >>> >>> /* Make arg->head point to first page and arg->pages point to rest */ >>> arg->head[0].iov_base = page_address(rqstp->rq_pages[0]); >>> - arg->head[0].iov_len = PAGE_SIZE; >>> + // FIXME: front-pad optimized to align TCP's WRITE payload >>> + // but may not be enough for other operations? >>> + arg->head[0].iov_len = 148; >>> arg->pages = rqstp->rq_pages + 1; >>> arg->page_base = 0; >>> /* save at least one page for response */ >>> >>> That gut "but may not be enough for other operations?" comment proved >>> to be prophetic. >>> >>> Sadly it went on to fail spectacularly for other ops (specifically >>> READDIR and READDIRPLUS, probably others would too) because >>> xdr_inline_decode() _really_ doesn't like going beyond the end of the >>> xdr_buf's inline "head" page. It could be that even if >>> xdr_inline_decode() et al was "fixed" (which isn't for the faint of >>> heart given xdr_buf's more complex nature) there will likely be other >>> mole(s) that pop up. And in addition, we'd be wasting space in the >>> xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying >>> to see this frontpad hack through to completion. >>> >>> 3) >>> Lastly, for completeness, I also mentioned briefly in a previous >>> recent reply: >>> >>> On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote: >>>> On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote: >>>> >>>>> In any case, for now at least, unless you're using RDMA, it's going to >>>>> end up falling back to buffered writes everywhere. The data is almost >>>>> never going to be properly aligned coming in off the wire. That might >>>>> be fixable though. >>>> >>>> Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA >>>> over TCP to workaround SUNRPC TCP's XDR handling always storing the >>>> write payload in misaligned IO. But that's purely a stop-gap >>>> workaround, which needs testing (to see if soft-iwap negates the win >>>> of using O_DIRECT, etc). >>> >>> (Ab)using soft-iwarp as the basis for easily getting page aligned TCP >>> WRITE payloads seems pretty gross given we are chasing utmost >>> performance, etc. >>> >>> All said, I welcome your sage advice and help on this effort to >>> DIO-align SUNRPC TCP's WRITE payload pages. >>> >>> Thanks, >>> Mike >> >> (Sent this to Mike only by accident yesterday -- resending to the full >> list now) >> >> I've been looking over the code today. Basically, I think we need to >> have svc_tcp_recvfrom() receive in phases. At a high level: >> >> 1/ receive the record marker (just like it does today) >> >> 2/ receive enough for the RPC header and then decode it. >> >> 3/ Use the rpc program and version from the decoded header to look up >> the svc_program. Add an optional pg_tcp_recvfrom callback to that >> structure that will receive the rest of the data into the buffer. If >> pg_tcp_recvfrom isn't set, then just call svc_tcp_read_msg() like we do >> today. > > The layering violations here are mind-blowing. What's already been mentioned elsewhere, but not yet here: The transmitter could always just tell the receiver where the data is, we'd need an NFS v3.1 and an extension for v4.2? Pot Stirred, Ben