On 8/4/25 12:21 PM, Trond Myklebust wrote:
On Mon, 2025-08-04 at 12:08 -0700, Dai Ngo wrote:
Currently, when an RPC connection times out during the connect phase,
the task is retried by placing it back on the pending queue and
waiting
again. In some cases, the timeout occurs because TCP is unable to
send
the SYN packet. This situation most often arises on bare metal
systems
at boot time, when the NFS mount is attempted while the network link
appears to be up but is not yet stable.
This patch addresses the issue by updating call_connect_status to
destroy
the transport on ETIMEDOUT error before retrying the connection. This
ensures that subsequent connection attempts use a fresh transport,
reducing the likelihood of repeated failures due to lingering network
issues.
Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
---
net/sunrpc/clnt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 21426c3049d3..701b742750c5 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2215,6 +2215,7 @@ call_connect_status(struct rpc_task *task)
case -EHOSTUNREACH:
case -EPIPE:
case -EPROTO:
+ case -ETIMEDOUT:
xprt_conditional_disconnect(task->tk_rqstp->rq_xprt,
task->tk_rqstp-
rq_connect_cookie);
if (RPC_IS_SOFTCONN(task))
@@ -2225,7 +2226,6 @@ call_connect_status(struct rpc_task *task)
case -EADDRINUSE:
case -ENOTCONN:
case -EAGAIN:
- case -ETIMEDOUT:
if (!(task->tk_flags & RPC_TASK_NO_ROUND_ROBIN) &&
(task->tk_flags & RPC_TASK_MOVEABLE) &&
test_bit(XPRT_REMOVE, &xprt->state)) {
Why is this needed? The ETIMEDOUT is supposed to be a task level error,
not a connection level thing.
If TCP was not able to sent the SYN out on due to the transient error
with the link status and the task just turns around a wait again, since
TCP does not retry the SYN, eventually systemd times out and stops the
mount.
Below is the snippet from the system log with rpcdebug enabled:
Jun 20 10:23:01 qa-i360-odi03 kernel: i40e 0000:98:00.0 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
Jun 20 10:23:09 qa-i360-odi03 NetworkManager[1511]: <info> [1750414989.6033] manager: startup complete
... <NFS mount starts>
Jun 20 10:23:09 qa-i360-odi03 systemd[1]: Mounting /odi...
...
Jun 20 10:23:09 qa-i360-odi03 kernel: RPC: 1 added to queue 0000000093f481cd "xprt_pending"
Jun 20 10:23:09 qa-i360-odi03 kernel: RPC: 1 setting alarm for 60000 ms
... <link status down & up>
Jun 20 10:23:10 qa-i360-odi03 kernel: tg3 0000:04:00.0 em1: Link is up at 1000 Mbps, full duplex
Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info> [1750414990.6359] device (em1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info> [1750414990.6361] device (em1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info> [1750414990.6364] device (em1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info> [1750414990.6383] device (em1): Activation: successful, device activated.
... <connect timed out>
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC: 1 timeout
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC: 1 __rpc_wake_up_task (now 4294742016)
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC: 1 disabling timer
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC: 1 removed from queue 0000000093f481cd "xprt_pending"
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC: 1 call_connect_status (status -110)
... <wait again>
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC: 1 sleep_on(queue "xprt_pending" time 4294742016)
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC: 1 added to queue 0000000093f481cd "xprt_pending"
... <systemd timed out>
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mounting timed out. Terminating.
Jun 20 10:24:39 qa-i360-odi03 kernel: RPC: 1 got signal
Jun 20 10:24:39 qa-i360-odi03 kernel: RPC: 1 __rpc_wake_up_task (now 4294770229)
Jun 20 10:24:39 qa-i360-odi03 kernel: RPC: 1 disabling timer
Jun 20 10:24:39 qa-i360-odi03 kernel: RPC: 1 removed from queue 0000000093f481cd "xprt_pending"
Jun 20 10:24:39 qa-i360-odi03 kernel: <-- nfs4_try_mount() = -512 [error]
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mount process exited, code=killed status=15
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Failed with result 'timeout'.
This patch forces TCP to send the SYN on ETIMEDOUT error.
-Dai
Oh... Is this because of TLS? If so, then please fix that to use a more
appropriate error.