(corrected netdev list) From: Wang Zhaolong <wangzhaolong1@xxxxxxxxxx> Date: Tue, 1 Apr 2025 21:54:47 +0800 > Hi. > > My colleagues and I have been investigating the issue addressed by this patch > and have discovered some significant concerns that require broader discussion. > > ### Socket Leak Issue > > After testing this patch extensively, I've confirmed it introduces a socket leak > when TCP connections don't complete proper termination (e.g., when FIN packets > are dropped). The leak manifests as a continuous increase in TCP slab usage: > > I've documented this issue with a reproducer in Bugzilla: > > https://bugzilla.kernel.org/show_bug.cgi?id=219972#c0 > > The key issue appears to stem from the interaction between the SMB client and TCP > socket lifecycle management: > > 1. Removing `sk->sk_net_refcnt = 1` causes TCP timers to be terminated early in > `tcp_close()` via the `inet_csk_clear_xmit_timers_sync()` call when > `!sk->sk_net_refcnt` > 2. This early timer termination prevents proper reference counting resolution > when connections don't complete the 4-way TCP termination handshake > 3. The resulting socket references are never fully released, leading to a leak > > #### Timeline of Related Changes > > 1. v4.2-rc1 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets") > - Added `sk_net_refcnt` field to distinguish user sockets (=1) from kernel sockets (=0) > - Kernel sockets don't hold netns references, which can lead to potential UAF issues > > 2. v6.9-rc2 151c9c724d05: ("tcp: properly terminate timers for kernel sockets") > - Modified `tcp_close()` to check `sk->sk_net_refcnt` and explicitly terminate timers for kernel sockets (=0) > - This prevents UAF when netns is destroyed before socket timers complete > - **Key change**: If `!sk->sk_net_refcnt`, call `inet_csk_clear_xmit_timers_sync()` > > 3. v6.12-rc7 ef7134c7fc48: ("smb: client: Fix use-after-free of network namespace") > - Fixed netns UAF in CIFS by manually setting `sk->sk_net_refcnt = 1` > - Also called `maybe_get_net()` to maintain netns references > - This effectively made kernel sockets behave like user sockets for reference counting > > 4. v6.13-rc4 e9f2517a3e18: ("smb: client: fix TCP timers deadlock after rmmod") > - Problem commit: Removed `sk->sk_net_refcnt = 1` setting > - Changed to using explicit `get_net()/put_net()` at CIFS layer > - This change leads to socket leaks because timers are terminated early > > ### Lockdep Warning Analysis > > I've also investigated the lockdep warning mentioned in the patch. My analysis > suggests it may be a false positive rather than an actual deadlock. Note that the 'deadlock' in the description is simply wrong as I mentioned before. The 'deadlock' means a lock which belongs to an unloaded module, and not actual deadlock like circular locking etc. https://lore.kernel.org/netdev/20241219084100.33837-1-kuniyu@xxxxxxxxxx/ > The crash > actually occurs in the lockdep subsystem itself (null pointer dereference in > `check_wait_context()`), not in the CIFS or networking code directly. > > The procedure for the null pointer dereference is as follows: > > When lockdep is enabled, the lock class "slock-AF_INET-CIFS" is set when a socket > connection is established. > > ``` > cifs_do_mount > cifs_mount > mount_get_conns > cifs_get_tcp_session > ip_connect > generic_ip_connect > cifs_reclassify_socket4 > sock_lock_init_class_and_name(sk, "slock-AF_INET-CIFS", > lockdep_init_map > lockdep_init_map_wait > lockdep_init_map_type > lockdep_init_map_type > register_lock_class > __set_bit(class - lock_classes, lock_classes_in_use); > ``` > > When the module is unloaded, the lock class is cleaned up. > > ``` > free_mod_mem > lockdep_free_key_range > __lockdep_free_key_range > zap_class > __clear_bit(class - lock_classes, lock_classes_in_use); > ``` > > After the module is uninstalled and the network connection is restored, the > timer is woken up. > > ``` > run_timer_softirq > run_timer_base > __run_timers > call_timer_fn > tcp_write_timer > bh_lock_sock > spin_lock(&((__sk)->sk_lock.slock)) > _raw_spin_lock > lock_acquire > __lock_acquire > check_wait_context > hlock_class > if (!test_bit(class_idx, lock_classes_in_use)) { > return NULL; > hlock_class(next)->wait_type_inner; // Null pointer dereference > ``` > > The problem lies within lockdep, as Kuniyuki says: > > > I tried the repro and confirmed it triggers null deref. > > > > It happens in LOCKDEP internal, so for me it looks like a problem in > > LOCKDEP rather than CIFS or TCP. > > > > I think LOCKDEP should hold a module reference and prevent related > > modules from being unloaded. > > Regarding the deadlock issue, it is clear that the locks mentioned in the deadlock warning > do not belong to the same lock instance. A deadlock should not occur. > > ### Discussion Points > > 1. API Design Question: Is this fundamentally an issue with how CIFS uses the socket > API, or is it a networking layer issue that should handle socket cleanup differently? > > 2. Approach to Resolution: Would it be better to: > - Revert to the original solution (setting `sk->sk_net_refcnt = 1`) from ef7134c7fc48? > - Work with networking subsystem maintainers on a more comprehensive solution that > handles socket cleanup properly? Could you test this patch with e9f2517a3e18 reverted ? ---8<--- diff --git a/include/net/sock.h b/include/net/sock.h index 8daf1b3b12c6..1e15a1209ea6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -547,6 +547,10 @@ struct sock { struct rcu_head sk_rcu; netns_tracker ns_tracker; struct xarray sk_user_frags; + +#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(MODULE) + struct module *sk_owner; +#endif }; struct sock_bh_locked { @@ -1583,6 +1587,15 @@ static inline void sk_mem_uncharge(struct sock *sk, int size) sk_mem_reclaim(sk); } +#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(MODULE) +static inline void sk_set_owner(struct sock *sk, struct module *owner) +{ + sk->sk_owner = module_get(owner); +} +#else +#define sk_set_owner(sk, owner) +#endif + /* * Macro so as to not evaluate some arguments when * lockdep is not enabled. @@ -1592,6 +1605,7 @@ static inline void sk_mem_uncharge(struct sock *sk, int size) */ #define sock_lock_init_class_and_name(sk, sname, skey, name, key) \ do { \ + sk_set_owner(sk, THIS_MODULE); \ sk->sk_lock.owned = 0; \ init_waitqueue_head(&sk->sk_lock.wq); \ spin_lock_init(&(sk)->sk_lock.slock); \ diff --git a/net/core/sock.c b/net/core/sock.c index 323892066def..2d91a92ed26d 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2324,6 +2324,11 @@ static void __sk_destruct(struct rcu_head *head) __netns_tracker_free(net, &sk->ns_tracker, false); net_passive_dec(net); } + +#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(MODULE) + if (sk->sk_owner) + module_put(sk->sk_owner); +#endif sk_prot_free(sk->sk_prot_creator, sk); } ---8<--- > > 3. CVE Process Question: Given that CVE-2024-54680 appears to "fix" a non-existent issue > while introducing an actual vulnerability, what's the appropriate way to address this? > > What's the best path forward? > > Best regards, > Wang Zhaolong > > > Adding fsdevel and networking in case any thoughts on this fix for > > network/namespaces refcount issue (that causes rmmod UAF). > > > > Any opinions on Enzo's proposed Fix? > > > > ---------- Forwarded message --------- > > From: Steve French <smfrench@xxxxxxxxx> > > Date: Tue, Dec 17, 2024 at 9:24 PM > > Subject: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod > > To: CIFS <linux-cifs@xxxxxxxxxxxxxxx> > > Cc: Kuniyuki Iwashima <kuniyu@xxxxxxxxxx>, Enzo Matsumiya <ematsumiya@xxxxxxx> > > > > > > Enzo had an interesting patch, that seems to fix an important problem. > > > > Here was his repro scenario: > > > > tw:~ # mount.cifs -o credentials=/root/wincreds,echo_interval=10 > > //someserver/target1 /mnt/test > > tw:~ # ls /mnt/test > > abc dir1 dir3 target1_file.txt tsub > > tw:~ # iptables -A INPUT -s someserver -j DROP > > > > Trigger reconnect and wait for 3*echo_interval: > > > > tw:~ # cat /mnt/test/target1_file.txt > > cat: /mnt/test/target1_file.txt: Host is down > > > > Then umount and rmmod. Note that rmmod might take several iterations > > until it properly tears down everything, so make sure you see the "not > > loaded" message before proceeding: > > > > tw:~ # umount /mnt/*; rmmod cifs > > umount: /mnt/az: not mounted. > > umount: /mnt/dfs: not mounted. > > umount: /mnt/local: not mounted. > > umount: /mnt/scratch: not mounted. > > rmmod: ERROR: Module cifs is in use > > ... > > tw:~ # rmmod cifs > > rmmod: ERROR: Module cifs is not currently loaded > > > > Then kickoff the TCP internals: > > tw:~ # iptables -F > > > > Gets the lockdep warning (requires CONFIG_LOCKDEP=y) + a NULL deref > > later on. > > > > > > Any thoughts on his patch? See below (and attached) > > > > Commit ef7134c7fc48 ("smb: client: Fix use-after-free of network > > namespace.") > > fixed a netns UAF by manually enabled socket refcounting > > (sk->sk_net_refcnt=1 and sock_inuse_add(net, 1)). > > > > The reason the patch worked for that bug was because we now hold > > references to the netns (get_net_track() gets a ref internally) > > and they're properly released (internally, on __sk_destruct()), > > but only because sk->sk_net_refcnt was set. > > > > Problem: > > (this happens regardless of CONFIG_NET_NS_REFCNT_TRACKER and regardless > > if init_net or other) > > > > Setting sk->sk_net_refcnt=1 *manually* and *after* socket creation is not > > only out of cifs scope, but also technically wrong -- it's set conditionally > > based on user (=1) vs kernel (=0) sockets. And net/ implementations > > seem to base their user vs kernel space operations on it. > > > > e.g. upon TCP socket close, the TCP timers are not cleared because > > sk->sk_net_refcnt=1: > > (cf. commit 151c9c724d05 ("tcp: properly terminate timers for > > kernel sockets")) > > > > net/ipv4/tcp.c: > > void tcp_close(struct sock *sk, long timeout) > > { > > lock_sock(sk); > > __tcp_close(sk, timeout); > > release_sock(sk); > > if (!sk->sk_net_refcnt) > > inet_csk_clear_xmit_timers_sync(sk); > > sock_put(sk); > > } > > > > Which will throw a lockdep warning and then, as expected, deadlock on > > tcp_write_timer(). > > > > A way to reproduce this is by running the reproducer from ef7134c7fc48 > > and then 'rmmod cifs'. A few seconds later, the deadlock/lockdep > > warning shows up. > > > > Fix: > > We shouldn't mess with socket internals ourselves, so do not set > > sk_net_refcnt manually. > > > > Also change __sock_create() to sock_create_kern() for explicitness. > > > > As for non-init_net network namespaces, we deal with it the best way > > we can -- hold an extra netns reference for server->ssocket and drop it > > when it's released. This ensures that the netns still exists whenever > > we need to create/destroy server->ssocket, but is not directly tied to > > it. > > > > Fixes: ef7134c7fc48 ("smb: client: Fix use-after-free of network > > namespace.") > > > > > > -- > > Thanks, > > > > Steve