Re: Fwd: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod

Wang Zhaolong <wangzhaolong1@xxxxxxxxxx> · Tue, 1 Apr 2025 21:54:47 +0800

Hi.

My colleagues and I have been investigating the issue addressed by this patch
and have discovered some significant concerns that require broader discussion.

### Socket Leak Issue

After testing this patch extensively, I've confirmed it introduces a socket leak
when TCP connections don't complete proper termination (e.g., when FIN packets
are dropped). The leak manifests as a continuous increase in TCP slab usage:

I've documented this issue with a reproducer in Bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=219972#c0

The key issue appears to stem from the interaction between the SMB client and TCP
socket lifecycle management:

1. Removing `sk->sk_net_refcnt = 1` causes TCP timers to be terminated early in
   `tcp_close()` via the `inet_csk_clear_xmit_timers_sync()` call when
   `!sk->sk_net_refcnt`
2. This early timer termination prevents proper reference counting resolution
   when connections don't complete the 4-way TCP termination handshake
3. The resulting socket references are never fully released, leading to a leak

#### Timeline of Related Changes

1. v4.2-rc1 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets")
   - Added `sk_net_refcnt` field to distinguish user sockets (=1) from kernel sockets (=0)
   - Kernel sockets don't hold netns references, which can lead to potential UAF issues

2. v6.9-rc2 151c9c724d05: ("tcp: properly terminate timers for kernel sockets")
   - Modified `tcp_close()` to check `sk->sk_net_refcnt` and explicitly terminate timers for kernel sockets (=0)
   - This prevents UAF when netns is destroyed before socket timers complete
   - **Key change**: If `!sk->sk_net_refcnt`, call `inet_csk_clear_xmit_timers_sync()`

3. v6.12-rc7 ef7134c7fc48: ("smb: client: Fix use-after-free of network namespace")
   - Fixed netns UAF in CIFS by manually setting `sk->sk_net_refcnt = 1`
   - Also called `maybe_get_net()` to maintain netns references
   - This effectively made kernel sockets behave like user sockets for reference counting

4. v6.13-rc4 e9f2517a3e18: ("smb: client: fix TCP timers deadlock after rmmod")
   - Problem commit: Removed `sk->sk_net_refcnt = 1` setting
   - Changed to using explicit `get_net()/put_net()` at CIFS layer
   - This change leads to socket leaks because timers are terminated early

### Lockdep Warning Analysis

I've also investigated the lockdep warning mentioned in the patch. My analysis
suggests it may be a false positive rather than an actual deadlock. The crash
actually occurs in the lockdep subsystem itself (null pointer dereference in
`check_wait_context()`), not in the CIFS or networking code directly.

The procedure for the null pointer dereference is as follows:

When lockdep is enabled, the lock class "slock-AF_INET-CIFS" is set when a socket
connection is established.

```
cifs_do_mount
  cifs_mount
    mount_get_conns
      cifs_get_tcp_session
        ip_connect
          generic_ip_connect
            cifs_reclassify_socket4
              sock_lock_init_class_and_name(sk, "slock-AF_INET-CIFS",
                lockdep_init_map
                  lockdep_init_map_wait
                    lockdep_init_map_type
                      lockdep_init_map_type
                        register_lock_class
                          __set_bit(class - lock_classes, lock_classes_in_use);
```

When the module is unloaded, the lock class is cleaned up.

```
free_mod_mem
  lockdep_free_key_range
    __lockdep_free_key_range
      zap_class
        __clear_bit(class - lock_classes, lock_classes_in_use);
```

After the module is uninstalled and the network connection is restored, the
timer is woken up.

```
run_timer_softirq
  run_timer_base
    __run_timers
      call_timer_fn
        tcp_write_timer
          bh_lock_sock
            spin_lock(&((__sk)->sk_lock.slock))
              _raw_spin_lock
                lock_acquire
                  __lock_acquire
                    check_wait_context
                      hlock_class
                       if (!test_bit(class_idx, lock_classes_in_use)) {
                          return NULL;
                      hlock_class(next)->wait_type_inner; // Null pointer dereference
```

The problem lies within lockdep, as Kuniyuki says:

I tried the repro and confirmed it triggers null deref.

It happens in LOCKDEP internal, so for me it looks like a problem in
LOCKDEP rather than CIFS or TCP.

I think LOCKDEP should hold a module reference and prevent related
modules from being unloaded.

Regarding the deadlock issue, it is clear that the locks mentioned in the deadlock warning
do not belong to the same lock instance. A deadlock should not occur.

### Discussion Points

1. API Design Question: Is this fundamentally an issue with how CIFS uses the socket
   API, or is it a networking layer issue that should handle socket cleanup differently?

2. Approach to Resolution: Would it be better to:
   - Revert to the original solution (setting `sk->sk_net_refcnt = 1`) from ef7134c7fc48?
   - Work with networking subsystem maintainers on a more comprehensive solution that
     handles socket cleanup properly?

3.  CVE Process Question: Given that CVE-2024-54680 appears to "fix" a non-existent issue
    while introducing an actual vulnerability, what's the appropriate way to address this?

What's the best path forward?

Best regards,
Wang Zhaolong

Adding fsdevel and networking in case any thoughts on this fix for
network/namespaces refcount issue (that causes rmmod UAF).

Any opinions on Enzo's proposed Fix?

---------- Forwarded message ---------
From: Steve French <smfrench@xxxxxxxxx>
Date: Tue, Dec 17, 2024 at 9:24 PM
Subject: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod
To: CIFS <linux-cifs@xxxxxxxxxxxxxxx>
Cc: Kuniyuki Iwashima <kuniyu@xxxxxxxxxx>, Enzo Matsumiya <ematsumiya@xxxxxxx>

Enzo had an interesting patch, that seems to fix an important problem.

Here was his repro scenario:

      tw:~ # mount.cifs -o credentials=/root/wincreds,echo_interval=10
//someserver/target1 /mnt/test
      tw:~ # ls /mnt/test
      abc  dir1  dir3  target1_file.txt  tsub
      tw:~ # iptables -A INPUT -s someserver -j DROP

Trigger reconnect and wait for 3*echo_interval:

      tw:~ # cat /mnt/test/target1_file.txt
      cat: /mnt/test/target1_file.txt: Host is down

Then umount and rmmod.  Note that rmmod might take several iterations
until it properly tears down everything, so make sure you see the "not
loaded" message before proceeding:

      tw:~ # umount /mnt/*; rmmod cifs
      umount: /mnt/az: not mounted.
      umount: /mnt/dfs: not mounted.
      umount: /mnt/local: not mounted.
      umount: /mnt/scratch: not mounted.
      rmmod: ERROR: Module cifs is in use
      ...
      tw:~ # rmmod cifs
      rmmod: ERROR: Module cifs is not currently loaded

Then kickoff the TCP internals:
      tw:~ # iptables -F

Gets the lockdep warning (requires CONFIG_LOCKDEP=y) + a NULL deref
later on.

Any thoughts on his patch?  See below (and attached)

     Commit ef7134c7fc48 ("smb: client: Fix use-after-free of network
namespace.")
     fixed a netns UAF by manually enabled socket refcounting
     (sk->sk_net_refcnt=1 and sock_inuse_add(net, 1)).

     The reason the patch worked for that bug was because we now hold
     references to the netns (get_net_track() gets a ref internally)
     and they're properly released (internally, on __sk_destruct()),
     but only because sk->sk_net_refcnt was set.

     Problem:
     (this happens regardless of CONFIG_NET_NS_REFCNT_TRACKER and regardless
     if init_net or other)

     Setting sk->sk_net_refcnt=1 *manually* and *after* socket creation is not
     only out of cifs scope, but also technically wrong -- it's set conditionally
     based on user (=1) vs kernel (=0) sockets.  And net/ implementations
     seem to base their user vs kernel space operations on it.

     e.g. upon TCP socket close, the TCP timers are not cleared because
     sk->sk_net_refcnt=1:
     (cf. commit 151c9c724d05 ("tcp: properly terminate timers for
kernel sockets"))

     net/ipv4/tcp.c:
         void tcp_close(struct sock *sk, long timeout)
         {
             lock_sock(sk);
             __tcp_close(sk, timeout);
             release_sock(sk);
             if (!sk->sk_net_refcnt)
                     inet_csk_clear_xmit_timers_sync(sk);
             sock_put(sk);
         }

     Which will throw a lockdep warning and then, as expected, deadlock on
     tcp_write_timer().

     A way to reproduce this is by running the reproducer from ef7134c7fc48
     and then 'rmmod cifs'.  A few seconds later, the deadlock/lockdep
     warning shows up.

     Fix:
     We shouldn't mess with socket internals ourselves, so do not set
     sk_net_refcnt manually.

     Also change __sock_create() to sock_create_kern() for explicitness.

     As for non-init_net network namespaces, we deal with it the best way
     we can -- hold an extra netns reference for server->ssocket and drop it
     when it's released.  This ensures that the netns still exists whenever
     we need to create/destroy server->ssocket, but is not directly tied to
     it.

     Fixes: ef7134c7fc48 ("smb: client: Fix use-after-free of network
namespace.")

--
Thanks,

Steve