Hi Trond, We hit a new failure mode with the containerized NFS client teardown patches. The symptoms are that we see the state recovery thread stuck, and then we see a flusher thread stuck like this: #0 context_switch (kernel/sched/core.c:5434:2) #1 __schedule (kernel/sched/core.c:6799:8) #2 __schedule_loop (kernel/sched/core.c:6874:3) #3 schedule (kernel/sched/core.c:6889:2) #4 nfs_wait_bit_killable (fs/nfs/inode.c:77:2) #5 __wait_on_bit (kernel/sched/wait_bit.c:49:10) #6 out_of_line_wait_on_bit (kernel/sched/wait_bit.c:64:9) #7 wait_on_bit_action (./include/linux/wait_bit.h:156:9) #8 nfs4_wait_clnt_recover (fs/nfs/nfs4state.c:1329:8) #9 nfs4_client_recover_expired_lease (fs/nfs/nfs4state.c:1347:9) #10 pnfs_update_layout (fs/nfs/pnfs.c:2133:17) #11 ff_layout_pg_get_mirror_count_write (fs/nfs/flexfilelayout/flexfilelayout.c:949:4) #12 nfs_pageio_setup_mirroring (fs/nfs/pagelist.c:1114:18) #13 nfs_pageio_add_request (fs/nfs/pagelist.c:1406:2) #14 nfs_page_async_flush (fs/nfs/write.c:631:7) #15 nfs_do_writepage (fs/nfs/write.c:657:9) #16 nfs_writepages_callback (fs/nfs/write.c:684:8) #17 write_cache_pages (mm/page-writeback.c:2569:11) #18 nfs_writepages (fs/nfs/write.c:723:9) #19 do_writepages (mm/page-writeback.c:2612:10) #20 __writeback_single_inode (fs/fs-writeback.c:1650:8) #21 writeback_sb_inodes (fs/fs-writeback.c:1942:3) #22 __writeback_inodes_wb (fs/fs-writeback.c:2013:12) #23 wb_writeback (fs/fs-writeback.c:2120:15) #24 wb_check_old_data_flush (fs/fs-writeback.c:2224:10) #25 wb_do_writeback (fs/fs-writeback.c:2277:11) #26 wb_workfn (fs/fs-writeback.c:2305:20) #27 process_one_work (kernel/workqueue.c:3267:2) #28 process_scheduled_works (kernel/workqueue.c:3348:3) #29 worker_thread (kernel/workqueue.c:3429:4) #30 kthread (kernel/kthread.c:388:9) #31 ret_from_fork (arch/x86/kernel/process.c:147:3) #32 ret_from_fork_asm+0x11/0x16 (arch/x86/entry/entry_64.S:244) Turning up some tracepoints, I see this: kworker/u1028:1-997456 [159] ..... 9027.330396: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEOUT|NORTO runstate=RUNNING|ACTIVE status=0 action=call_reserve [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330397: xprt_reserve: task:0000174b@00000000 xid=0x6bfc099c kworker/u1028:1-997456 [159] ..... 9027.330397: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEO UT|NORTO runstate=RUNNING|ACTIVE status=0 action=call_reserveresult [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330397: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEO UT|NORTO runstate=RUNNING|ACTIVE status=0 action=call_refresh [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330399: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEO UT|NORTO runstate=RUNNING|ACTIVE status=0 action=call_refreshresult [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330399: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEO UT|NORTO runstate=RUNNING|ACTIVE status=0 action=call_allocate [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330400: rpc_buf_alloc: task:0000174b@00000000 callsize=368 recvsize=80 status=0 kworker/u1028:1-997456 [159] ..... 9027.330400: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEOUT|NORTO runstate=RUNNING|ACTIVE status=0 action=call_encode [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330402: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEOUT|NORTO runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330403: xprt_reserve_xprt: task:0000174b@00000000 snd_task:0000174b kworker/u1028:1-997456 [159] ..... 9027.330404: xprt_connect: peer=[2a03:83e4:4002:0:12a::2a]:20492 state=LOCKED|BOUND kworker/u1028:1-997456 [159] ..... 9027.330405: rpc_task_sleep: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEOUT|NORTO runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 timeout=0 queue=xprt_pending kworker/u1028:1-997456 [159] ..... 9027.330442: rpc_socket_connect: error=-22 socket:[15084763] srcaddr=[::]:993 dstaddr=[2a03:83e4:4002:0:12a::2a]:0 state=1 (UNCONNECTED) sk_state=7 (CLOSE) kworker/u1028:1-997456 [159] ..... 9027.330443: rpc_task_wakeup: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEOUT|NORTO runstate=QUEUED|ACTIVE|NEED_XMIT|NEED_RECV status=-22 timeout=60000 queue=xprt_pending kworker/u1028:1-997456 [159] ..... 9027.330444: xprt_disconnect_force: peer=[2a03:83e4:4002:0:12a::2a]:20492 state=LOCKED|CONNECTING|BOUND|SND_IS_COOKIE kworker/u1028:1-997456 [159] ..... 9027.330444: xprt_release_xprt: task:ffffffff@ffffffff snd_task:ffffffff kworker/u1028:1-997456 [159] ..... 9027.330445: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEOUT|NORTO runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-22 action=call_connect_status [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330446: rpc_connect_status: task:0000174b@00000000 status=-22 kworker/u1028:1-997456 [159] ..... 9027.330446: rpc_call_rpcerror: task:0000174b@00000000 tk_status=-22 rpc_status=-22 kworker/u1028:1-997456 [159] ..... 9027.330446: rpc_task_run_action: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEO UT|NORTO runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-22 action=rpc_exit_task [sunrpc] kworker/u1028:1-997456 [159] ..... 9027.330447: rpc_task_end: task:0000174b@00000000 flags=ASYNC|MOVEABLE|NETUNREACH_FATAL|DYNAMIC|SOFT|TIMEOUT|NORTO runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-22 action=rpc_exit_task [sunrpc] There is no traffic on the wire. So, the problem seems to be in this situation that we're calling call_connect, and getting back -EINVAL from the network layer, which doesn't trigger the -ENETUNREACH handling. Is this a bug from the network layer, or do we also need to handle - EINVAL in call_connect_status? ...or do we need a different fix? Thoughts? -- Jeff Layton <jlayton@xxxxxxxxxx>