2025/5/23 03:25, "Martin KaFai Lau" <martin.lau@xxxxxxxxx> wrote: > > On 5/16/25 7:17 AM, Jiayuan Chen wrote: > > > > > The sk->sk_socket is not locked or referenced in backlog thread, and > > > > during the call to skb_send_sock(), there is a race condition with > > > > the release of sk_socket. All types of sockets(tcp/udp/unix/vsock) > > > > will be affected. > > > > Race conditions: > > > > ''' > > > > CPU0 CPU1 > > > > backlog::skb_send_sock > > > > sendmsg_unlocked > > > > sock_sendmsg > > > > sock_sendmsg_nosec > > > > close(fd): > > > > ... > > > > ops->release() -> sock_map_close() > > > > sk_socket->ops = NULL > > > > free(socket) > > > > sock->ops->sendmsg > > > > ^ > > > > panic here > > > > ''' > > > > The ref of psock become 0 after sock_map_close() executed. > > > > ''' > > > > void sock_map_close() > > > > { > > > > ... > > > > if (likely(psock)) { > > > > ... > > > > // !! here we remove psock and the ref of psock become 0 > > > > sock_map_remove_links(sk, psock) > > > > psock = sk_psock_get(sk); > > > > if (unlikely(!psock)) > > > > goto no_psock; <=== Control jumps here via goto > > > > ... > > > > cancel_delayed_work_sync(&psock->work); <=== not executed > > > > sk_psock_put(sk, psock); > > > > ... > > > > } > > > > ''' > > > > Based on the fact that we already wait for the workqueue to finish in > > > > sock_map_close() if psock is held, we simply increase the psock > > > > reference count to avoid race conditions. > > > > With this patch, if the backlog thread is running, sock_map_close() will > > > > wait for the backlog thread to complete and cancel all pending work. > > > > If no backlog running, any pending work that hasn't started by then will > > > > fail when invoked by sk_psock_get(), as the psock reference count have > > > > been zeroed, and sk_psock_drop() will cancel all jobs via > > > > cancel_delayed_work_sync(). > > > > In summary, we require synchronization to coordinate the backlog thread > > > > and close() thread. > > > > The panic I catched: > > > > ''' > > > > Workqueue: events sk_psock_backlog > > > > RIP: 0010:sock_sendmsg+0x21d/0x440 > > > > RAX: 0000000000000000 RBX: ffffc9000521fad8 RCX: 0000000000000001 > > > > ... > > > > Call Trace: > > > > <TASK> > > > > ? die_addr+0x40/0xa0 > > > > ? exc_general_protection+0x14c/0x230 > > > > ? asm_exc_general_protection+0x26/0x30 > > > > ? sock_sendmsg+0x21d/0x440 > > > > ? sock_sendmsg+0x3e0/0x440 > > > > ? __pfx_sock_sendmsg+0x10/0x10 > > > > __skb_send_sock+0x543/0xb70 > > > > sk_psock_backlog+0x247/0xb80 > > > > ... > > > > ''' > > > > Reported-by: Michal Luczaj <mhal@xxxxxxx> > > > > Fixes: 4b4647add7d3 ("sock_map: avoid race between sock_map_close and sk_psock_put") > > > > Signed-off-by: Jiayuan Chen <jiayuan.chen@xxxxxxxxx> > > > > --- > > > > V5 -> V6: Use correct "Fixes" tag. > > > > V4 -> V5: > > > > This patch is extracted from my previous v4 patchset that contained > > > > multiple fixes, and it remains unchanged. Since this fix is relatively > > > > simple and easy to review, we want to separate it from other fixes to > > > > avoid any potential interference. > > > > --- > > > > net/core/skmsg.c | 8 ++++++++ > > > > 1 file changed, 8 insertions(+) > > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > > > index 276934673066..34c51eb1a14f 100644 > > > > --- a/net/core/skmsg.c > > > > +++ b/net/core/skmsg.c > > > > @@ -656,6 +656,13 @@ static void sk_psock_backlog(struct work_struct *work) > > > > bool ingress; > > > > int ret; > > > > > + /* Increment the psock refcnt to synchronize with close(fd) path in > > > > + * sock_map_close(), ensuring we wait for backlog thread completion > > > > + * before sk_socket freed. If refcnt increment fails, it indicates > > > > + * sock_map_close() completed with sk_socket potentially already freed. > > > > + */ > > > > + if (!sk_psock_get(psock->sk)) > > > > This seems to be the first use case to pass "psock->sk" to "sk_psock_get()". > > I could have missed the sock_map details here. Considering it is racing with sock_map_close() which should also do a sock_put(sk) [?], > > could you help to explain what makes it safe to access the psock->sk here? > > > > > + return; > > > > mutex_lock(&psock->work_mutex); > > > > while ((skb = skb_peek(&psock->ingress_skb))) { > > > > len = skb->len; > > > > @@ -708,6 +715,7 @@ static void sk_psock_backlog(struct work_struct *work) > > > > } > > > > end: > > > > mutex_unlock(&psock->work_mutex); > > > > + sk_psock_put(psock->sk, psock); > > > > } > > > > > struct sk_psock *sk_psock_init(struct sock *sk, int node) > > > Hi Martin, Using 'sk_psock_get(psock->sk)' in the workqueue is safe because sock_map_close() only reduces the reference count of psock to zero, while the actual memory release is fully handled by the RCU callback: sk_psock_destroy(). In sk_psock_destroy(), we first cancel_delayed_work_sync() to wait for the workqueue to complete, and then perform sock_put(psock->sk). This means we already have an explicit synchronization mechanism in place that guarantees safe access to both psock and psock->sk in the workqueue context. Thanks.