[PATCH AUTOSEL 6.6 08/13] sunrpc: don't immediately retransmit on seqno miss

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Nikhil Jha <njha@xxxxxxxxxxxxxx>

[ Upstream commit fadc0f3bb2de8c570ced6d9c1f97222213d93140 ]

RFC2203 requires that retransmitted messages use a new gss sequence
number, but the same XID. This means that if the server is just slow
(e.x. overloaded), the client might receive a response using an older
seqno than the one it has recorded.

Currently, Linux's client immediately retransmits in this case. However,
this leads to a lot of wasted retransmits until the server eventually
responds faster than the client can resend.

Client -> SEQ 1 -> Server
Client -> SEQ 2 -> Server
Client <- SEQ 1 <- Server (misses, expecting seqno = 2)
Client -> SEQ 3 -> Server (immediate retransmission on miss)
Client <- SEQ 2 <- Server (misses, expecting seqno = 3)
Client -> SEQ 4 -> Server (immediate retransmission on miss)
... and so on ...

This commit makes it so that we ignore messages with bad checksums
due to seqnum mismatch, and rely on the usual timeout behavior for
retransmission instead of doing so immediately.

Signed-off-by: Nikhil Jha <njha@xxxxxxxxxxxxxx>
Acked-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
Signed-off-by: Anna Schumaker <anna.schumaker@xxxxxxxxxx>
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---

Based on my analysis of the commit and the similar backported commits as
reference, here is my assessment:

**YES**

This commit should be backported to stable kernel trees. Here's my
detailed analysis:

## Key Analysis Points:

### 1. **Bug Fix Nature**:
This commit addresses a clear performance bug in SUNRPC's RPCSEC_GSS
sequence number handling. The current code immediately retransmits when
it receives a response with an older sequence number than expected,
leading to excessive and wasteful retransmissions when the server is
simply slow.

### 2. **Alignment with RFC2203**:
The commit correctly implements RFC2203 requirements, which state that
retransmitted messages should use new GSS sequence numbers but keep the
same XID. The current behavior creates a retransmission storm that
violates the spirit of this requirement.

### 3. **Code Changes Analysis**:
The modification in `rpc_decode_header()` at the `out_verifier` label is
minimal and targeted:

**Before** (lines 2773-2776):
```c
case -EACCES:
    /* Re-encode with a fresh cred */
    fallthrough;
default:
    goto out_garbage;
```

**After**:
```c
case -EACCES:
    /* possible RPCSEC_GSS out-of-sequence event (RFC2203),
     - reset recv state and keep waiting, don't retransmit
     */
    task->tk_rqstp->rq_reply_bytes_recvd = 0;
    task->tk_status = xprt_request_enqueue_receive(task);
    task->tk_action = call_transmit_status;
    return -EBADMSG;
default:
    goto out_garbage;
```

### 4. **Risk Assessment - LOW**:
- **Scope**: Changes are confined to a specific error handling path in
  SUNRPC client code
- **Behavioral change**: Instead of immediate retransmission, it now
  waits for normal timeout-based retransmission
- **Fallback**: The normal timeout mechanism still provides
  retransmission if needed
- **Error handling**: Uses existing infrastructure
  (`xprt_request_enqueue_receive`, `call_transmit_status`)

### 5. **Comparison with Similar Backported Commits**:
This commit shares characteristics with successful backports:
- **Similar Commit #1**: Also fixed SUNRPC verifier handling by changing
  `goto out_err` to `goto out_garbage`
- **Similar Commit #2**: Fixed incomplete RPC message handling with
  proper state management
- **Similar Commit #5**: Fixed RPCSEC_GSS sequence number limit handling

All these were backported because they fixed protocol compliance issues
with minimal risk.

### 6. **User Impact**:
- **Performance improvement**: Significantly reduces unnecessary
  retransmissions under server load
- **Network efficiency**: Reduces wasted bandwidth and server load
- **No breaking changes**: Maintains compatibility while improving
  behavior

### 7. **Stability Criteria Compliance**:
✅ **Important bugfix**: Fixes performance degradation and protocol
compliance
✅ **Minimal risk**: Small, localized change to error handling path
✅ **No new features**: Pure bugfix, no architectural changes
✅ **Subsystem-contained**: Changes only affect SUNRPC client verifier
handling
✅ **Clear benefit**: Reduces retransmission storms in production
environments

The commit follows the stable tree rules perfectly: it's a targeted fix
for a real-world performance problem with minimal risk of regression,
affecting only the specific error case of RPCSEC_GSS sequence number
mismatches.

 net/sunrpc/clnt.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 4ffb2bcaf3648..63756607f6327 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2733,8 +2733,13 @@ rpc_decode_header(struct rpc_task *task, struct xdr_stream *xdr)
 	case -EPROTONOSUPPORT:
 		goto out_err;
 	case -EACCES:
-		/* Re-encode with a fresh cred */
-		fallthrough;
+		/* possible RPCSEC_GSS out-of-sequence event (RFC2203),
+		 * reset recv state and keep waiting, don't retransmit
+		 */
+		task->tk_rqstp->rq_reply_bytes_recvd = 0;
+		task->tk_status = xprt_request_enqueue_receive(task);
+		task->tk_action = call_transmit_status;
+		return -EBADMSG;
 	default:
 		goto out_garbage;
 	}
-- 
2.39.5





[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux