On Wed, 2025-06-11 at 17:36 -0400, Mike Snitzer wrote: > On Wed, Jun 11, 2025 at 04:29:58PM -0400, Jeff Layton wrote: > > On Wed, 2025-06-11 at 15:18 -0400, Mike Snitzer wrote: > > > On Wed, Jun 11, 2025 at 10:31:20AM -0400, Chuck Lever wrote: > > > > On 6/10/25 4:57 PM, Mike Snitzer wrote: > > > > > Add 'enable-dontcache' to NFSD's debugfs interface so that: Any data > > > > > read or written by NFSD will either not be cached (thanks to O_DIRECT) > > > > > or will be removed from the page cache upon completion (DONTCACHE). > > > > > > > > I thought we were going to do two switches: One for reads and one for > > > > writes? I could be misremembering. > > > > > > We did discuss the possibility of doing that. Still can-do if that's > > > what you'd prefer. > > > > > > > Having them as separate controls in debugfs is fine for > > experimentation's sake, but I imagine we'll need to be all-in one way > > or the other with a real interface. > > > > I think if we can crack the problem of receiving WRITE payloads into an > > already-aligned buffer, then that becomes much more feasible. I think > > that's a solveable problem. > > You'd immediately be my hero! Let's get into it: > > In a previously reply to this thread you aptly detailed what I found > out the hard way (with too much xdr_buf code review and tracing): > > On Wed, Jun 11, 2025 at 08:55:20AM -0400, Jeff Layton wrote: > > > > > > NFSD will also set RWF_DIRECT if a WRITE's IO is aligned relative to > > > DIO alignment (both page and disk alignment). This works quite well > > > for aligned WRITE IO with SUNRPC's RDMA transport as-is, because it > > > maps the WRITE payload into aligned pages. But more work is needed to > > > be able to leverage O_DIRECT when SUNRPC's regular TCP transport is > > > used. I spent quite a bit of time analyzing the existing xdr_buf code > > > and NFSD's use of it. Unfortunately, the WRITE payload gets stored in > > > misaligned pages such that O_DIRECT isn't possible without a copy > > > (completely defeating the point). I'll reply to this cover letter to > > > start a subthread to discuss how best to deal with misaligned write > > > IO (by association with Hammerspace, I'm most interested in NFS v3). > > > > > > > Tricky problem. svc_tcp_recvfrom() just slurps the whole RPC into the > > rq_pages array. To get alignment right, you'd probably have to do the > > receive in a much more piecemeal way. > > > > Basically, you'd need to decode as you receive chunks of the message, > > and look out for WRITEs, and then set it up so that their payloads are > > received with proper alignment. > > 1) > Yes, and while I arrived at the same exact conclusion I was left with > dread about the potential for "breaking too many eggs to make that > tasty omelette". > > If you (or others) see a way forward to have SUNRPC TCP's XDR receive > "inline" decode (rather than have the 2 stage process you covered > above) that'd be fantastic. Seems like really old tech-debt in SUNRPC > from a time when such care about alignment of WRITE payload pages was > completely off engineers' collective radar (owed to NFSD only using > buffered IO I assume?). > > 2) > One hack that I verified to work for READ and WRITE IO on my > particular TCP testbed was to front-pad the first "head" page of the > xdr_buf such that the WRITE payload started at the 2nd page of > rq_pages. So that looked like this hack for my usage: > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c > index 8fc5b2b2d806..cf082a265261 100644 > --- a/net/sunrpc/svc_xprt.c > +++ b/net/sunrpc/svc_xprt.c > @@ -676,7 +676,9 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp) > > /* Make arg->head point to first page and arg->pages point to rest */ > arg->head[0].iov_base = page_address(rqstp->rq_pages[0]); > - arg->head[0].iov_len = PAGE_SIZE; > + // FIXME: front-pad optimized to align TCP's WRITE payload > + // but may not be enough for other operations? > + arg->head[0].iov_len = 148; > arg->pages = rqstp->rq_pages + 1; > arg->page_base = 0; > /* save at least one page for response */ > > That gut "but may not be enough for other operations?" comment proved > to be prophetic. > > Sadly it went on to fail spectacularly for other ops (specifically > READDIR and READDIRPLUS, probably others would too) because > xdr_inline_decode() _really_ doesn't like going beyond the end of the > xdr_buf's inline "head" page. It could be that even if > xdr_inline_decode() et al was "fixed" (which isn't for the faint of > heart given xdr_buf's more complex nature) there will likely be other > mole(s) that pop up. And in addition, we'd be wasting space in the > xdr_buf's head page (PAGE_SIZE-frontpad). So I moved on from trying > to see this frontpad hack through to completion. > > 3) > Lastly, for completeness, I also mentioned briefly in a previous > recent reply: > > On Wed, Jun 11, 2025 at 04:51:03PM -0400, Mike Snitzer wrote: > > On Wed, Jun 11, 2025 at 11:44:29AM -0400, Jeff Layton wrote: > > > > > In any case, for now at least, unless you're using RDMA, it's going to > > > end up falling back to buffered writes everywhere. The data is almost > > > never going to be properly aligned coming in off the wire. That might > > > be fixable though. > > > > Ben Coddington mentioned to me that soft-iwarp would allow use of RDMA > > over TCP to workaround SUNRPC TCP's XDR handling always storing the > > write payload in misaligned IO. But that's purely a stop-gap > > workaround, which needs testing (to see if soft-iwap negates the win > > of using O_DIRECT, etc). > > (Ab)using soft-iwarp as the basis for easily getting page aligned TCP > WRITE payloads seems pretty gross given we are chasing utmost > performance, etc. > > All said, I welcome your sage advice and help on this effort to > DIO-align SUNRPC TCP's WRITE payload pages. > > Thanks, > Mike (Sent this to Mike only by accident yesterday -- resending to the full list now) I've been looking over the code today. Basically, I think we need to have svc_tcp_recvfrom() receive in phases. At a high level: 1/ receive the record marker (just like it does today) 2/ receive enough for the RPC header and then decode it. 3/ Use the rpc program and version from the decoded header to look up the svc_program. Add an optional pg_tcp_recvfrom callback to that structure that will receive the rest of the data into the buffer. If pg_tcp_recvfrom isn't set, then just call svc_tcp_read_msg() like we do today. For NFSv3, pc_tcp_recvfrom can just look at the procedure. If it's anything but a WRITE we'll just do what we do today (svc_tcp_read_msg()). For a WRITE, we'll receive the first part of the WRITE3args (everything but the data) into rq_pages, and decode it. We can then use that info to figure out the alignment. Advance to the next page in rq_pages, and then to the point where the data is properly aligned. Do the receive into that spot. Then we just add a RQ_ALIGNED_DATA to rqstp->rq_flags, and teach nfsd3_proc_write how to find the data and do a DIO write when it's set. Unaligned writes are still a problem though. If two WRITE RPCs come in for different parts of the same block at the same time, then you could end up losing the result of the first write. I don't see a way to make that non-racy. NFSv4 will also be a bit of a challenge. We'll need to receive the whole compound one operation at a time. If we hit a WRITE, then we can just do the same thing that we do for v3 to align the data. I'd probably aim to start with an implementation for v3, and then add v4 support in a second phase. I'm interested in working on this. It'll be a fair bit of work though. I'll need to think about how to break this up into manageable pieces. -- Jeff Layton <jlayton@xxxxxxxxxx>