Re: [PATCH 1/1] SUNRPC: call_connect_status needs to destroy transport on ETIMEDOUT before retry

Dai Ngo <dai.ngo@xxxxxxxxxx> · Tue, 5 Aug 2025 11:25:49 -0700

On 8/5/25 10:41 AM, Trond Myklebust wrote:
On Tue, 2025-08-05 at 10:07 -0700, Dai Ngo wrote:
On 8/5/25 9:41 AM, Trond Myklebust wrote:
On Tue, 2025-08-05 at 08:46 -0700, Dai Ngo wrote:
On 8/4/25 4:55 PM, Trond Myklebust wrote:
On Mon, 2025-08-04 at 13:13 -0700, Dai Ngo wrote:
On 8/4/25 12:21 PM, Trond Myklebust wrote:
On Mon, 2025-08-04 at 12:08 -0700, Dai Ngo wrote:
Currently, when an RPC connection times out during the
connect
phase,
the task is retried by placing it back on the pending
queue
and
waiting
again. In some cases, the timeout occurs because TCP is
unable to
send
the SYN packet. This situation most often arises on bare
metal
systems
at boot time, when the NFS mount is attempted while the
network
link
appears to be up but is not yet stable.

This patch addresses the issue by updating
call_connect_status to
destroy
the transport on ETIMEDOUT error before retrying the
connection.
This
ensures that subsequent connection attempts use a fresh
transport,
reducing the likelihood of repeated failures due to
lingering
network
issues.

Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
---
     net/sunrpc/clnt.c | 2 +-
     1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 21426c3049d3..701b742750c5 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -2215,6 +2215,7 @@ call_connect_status(struct rpc_task
*task)
     	case -EHOSTUNREACH:
     	case -EPIPE:
     	case -EPROTO:
+	case -ETIMEDOUT:
     		xprt_conditional_disconnect(task-
tk_rqstp-
rq_xprt,
     					    task-
tk_rqstp-
rq_connect_cookie);
     		if (RPC_IS_SOFTCONN(task))
@@ -2225,7 +2226,6 @@ call_connect_status(struct rpc_task
*task)
     	case -EADDRINUSE:
     	case -ENOTCONN:
     	case -EAGAIN:
-	case -ETIMEDOUT:
     		if (!(task->tk_flags &
RPC_TASK_NO_ROUND_ROBIN)
&&
     		    (task->tk_flags & RPC_TASK_MOVEABLE)
&&
     		    test_bit(XPRT_REMOVE, &xprt->state))
{
Why is this needed? The ETIMEDOUT is supposed to be a task
level
error,
not a connection level thing.
If TCP was not able to sent the SYN out on due to the
transient
error
with the link status and the task just turns around a wait
again,
since
TCP does not retry the SYN, eventually systemd times out and
stops
the
mount.


Below is the snippet from the system log with rpcdebug
enabled:


Jun 20 10:23:01 qa-i360-odi03 kernel: i40e 0000:98:00.0 eth1:
NIC
Link is Up, 10 Gbps Full Duplex, Flow Control: None
Jun 20 10:23:09 qa-i360-odi03 NetworkManager[1511]: <info>
[1750414989.6033] manager: startup complete

... <NFS mount starts>
Jun 20 10:23:09 qa-i360-odi03 systemd[1]: Mounting /odi...
...
Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 added to
queue
0000000093f481cd "xprt_pending"
Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 setting
alarm
for
60000 ms

... <link status down & up>
Jun 20 10:23:10 qa-i360-odi03 kernel: tg3 0000:04:00.0 em1:
Link
is
up at 1000 Mbps, full duplex
Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
[1750414990.6359] device (em1): state change: disconnected ->
prepare
(reason 'none', sys-iface-state: 'managed')
Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
[1750414990.6361] device (em1): state change: prepare ->
config
(reason 'none', sys-iface-state: 'managed')
Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
[1750414990.6364] device (em1): state change: config -> ip-
config
(reason 'none', sys-iface-state: 'managed')
Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
[1750414990.6383] device (em1): Activation: successful,
device
activated.

... <connect timed out>
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 timeout
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
__rpc_wake_up_task
(now 4294742016)
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 disabling
timer
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 removed from
queue
0000000093f481cd "xprt_pending"
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
call_connect_status
(status -110)

... <wait again>
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1
sleep_on(queue
"xprt_pending" time 4294742016)
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 added to
queue
0000000093f481cd "xprt_pending"

... <systemd timed out>
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mounting
timed
out. Terminating.

Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 got signal
Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1
__rpc_wake_up_task
(now 4294770229)
Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 disabling
timer
Jun 20 10:24:39 qa-i360-odi03 kernel: RPC:     1 removed from
queue
0000000093f481cd "xprt_pending"

Jun 20 10:24:39 qa-i360-odi03 kernel: <-- nfs4_try_mount() =
-512
[error]
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mount
process
exited, code=killed status=15
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Failed
with
result 'timeout'.

This patch forces TCP to send the SYN on ETIMEDOUT error.

Something is very wrong here...

If this patch is correct, and the call to
xprt_conditional_disconnect()
does indeed cause the socket to close, then something must have
bumped
xprt->connect_cookie.
I'm a little confused here, xprt_conditional_disconnect only
closes
the
socket if the connect cookie is still the same:

           if (cookie != xprt->connect_cookie)
                   goto out;

So in this case the xprt->connect_cookie has not been bumped.
You're right... Sorry...

So does this mean that the socket is still in the SYN-SENT state?
It is
normally supposed to remain in that state for 60 seconds, and
resend
SYN using an exponential back off.
Apparently this was not done by TCP in this scenario:

... mount starts
Jun 20 10:23:09 qa-i360-odi03 systemd[1]: Mounting /odi...
Jun 20 10:23:09 qa-i360-odi03 kernel: RPC:     1 added to queue
0000000093f481cd "xprt_pending"

Jun 20 10:23:10 qa-i360-odi03 NetworkManager[1511]: <info>
[1750414990.6383] device (em1): Activation: successful, device
activated.

... RPC task timed out on connect after 62 secs.  Note that at this
time the network link status was already up for more than 60 secs
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 timeout
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 __rpc_wake_up_task
(now 4294742016)

... RPC task goes back and wait again.
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 sleep_on(queue
"xprt_pending" time 4294742016)
Jun 20 10:24:11 qa-i360-odi03 kernel: RPC:     1 added to queue
0000000093f481cd "xprt_pending"

... nothing happened and systemd timed out after 90 secs and kill the
mount
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mounting timed
out. Terminating.
Jun 20 10:24:39 qa-i360-odi03 kernel: <-- nfs4_try_mount() = -512
[error]
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Mount process
exited, code=killed status=15
Jun 20 10:24:39 qa-i360-odi03 systemd[1]: odi.mount: Failed with
result 'timeout'.

I don't know why TCP does not retry the SYN after the networkk link
is
up again but I though instead of trying to figure out and fix the
problem
at TCP layer, we can make the RPC connect process a bit more robust
by using
a new socket for retry.

We should never be using individual RPC message timeouts as a measure
for whether or not the TCP connection is down.

For the initial connection attempt, we use the TCP_SYNCNT socket option
to decide when to give up.

The problem we're having is the initial connection attempt at boot time,
NFS mount started while the network link has been stable yet. NetworkManager
could also be blamed here, it claims 'startup complete' which allows
systemd to start the NFS mounts.

And for some reasons TCP did not send the SYN under this flaky link condition,
even after the network link is up again; we don't see any SYN in the network
trace. The TCP_SYNCNT option is not in play here.

This seems like a bug in the TCP/network layer and it seems to me that
you prefer to push the responsibility to resolve this problem down to
TCP/network stack.

This problem is reproducible, do you want to collect any additional data
before making the final decision?

-Dai

  Once the connection is established, we set
TCP_USER_TIMEOUT as well as socket keepalive to ensure that the
connection is still alive.

I'm willing to consider further socket-specific timers, but not if the
reason is to work around bugs (assuming that is indeed the case here).