On Thu, Mar 20, 2025 at 09:16:15AM -0400, Chuck Lever wrote: > On 3/19/25 1:02 PM, Nikhil Jha via B4 Relay wrote: > > When the client retransmits an operation (for example, because the > > server is slow to respond), a new GSS sequence number is associated with > > the XID. In the current kernel code the original sequence number is > > discarded. Subsequently, if a response to the original request is > > received there will be a GSS sequence number mismatch. A mismatch will > > trigger another retransmit, possibly repeating the cycle, and after some > > number of failed retries EACCES is returned. > > > > RFC2203, section 5.3.3.1 suggests a possible solution... “cache the > > RPCSEC_GSS sequence number of each request it sends” and "compute the > > checksum of each sequence number in the cache to try to match the > > checksum in the reply's verifier." This is what FreeBSD’s implementation > > does (rpc_gss_validate in sys/rpc/rpcsec_gss/rpcsec_gss.c). > > > > However, even with this cache, retransmits directly caused by a seqno > > mismatch can still cause a bad message interleaving that results in this > > bug. The RFC already suggests ignoring incorrect seqnos on the server > > side, and this seems symmetric, so this patchset also applies that > > behavior to the client. > > > > These two patches are *not* dependent on each other. I tested them by > > delaying packets with a Python script hooked up to NFQUEUE. If it would > > be helpful I can send this script along as well. > > > > Signed-off-by: Nikhil Jha <njha@xxxxxxxxxxxxxx> > > --- > > Changes since v1: > > * Maintain the invariant that the first seqno is always first in > > rq_seqnos, so that it doesn't need to be stored twice. > > * Minor formatting, and resending with proper mailing-list headers so the > > patches are easier to work with. > > > > --- > > Nikhil Jha (2): > > sunrpc: implement rfc2203 rpcsec_gss seqnum cache > > sunrpc: don't immediately retransmit on seqno miss > > > > include/linux/sunrpc/xprt.h | 17 +++++++++++- > > include/trace/events/rpcgss.h | 4 +-- > > include/trace/events/sunrpc.h | 2 +- > > net/sunrpc/auth_gss/auth_gss.c | 59 ++++++++++++++++++++++++++---------------- > > net/sunrpc/clnt.c | 9 +++++-- > > net/sunrpc/xprt.c | 3 ++- > > 6 files changed, 64 insertions(+), 30 deletions(-) > > --- > > base-commit: 7eb172143d5508b4da468ed59ee857c6e5e01da6 > > change-id: 20250314-rfc2203-seqnum-cache-52389d14f567 > > > > Best regards, > > This seems like a sensible thing to do to me. > > Acked-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > > -- > Chuck Lever Hi, We've been running this patch for a while now and noticed a (very silly in hindsight) bug. maj_stat = gss_validate_seqno_mic(ctx, task->tk_rqstp->rq_seqnos[i], seq, p, len); needs to be maj_stat = gss_validate_seqno_mic(ctx, task->tk_rqstp->rq_seqnos[i++], seq, p, len); Or the kernel gets stuck in a loop when you have more than two retries. I can resend this patch but I noticed it's already made its way into quite a few trees. Should this be a separate patch instead? - Nikhil