On 2025-05-24 00:22:19, Jiayuan Chen wrote: > We observed an issue from the latest selftest: sockmap_redir where > sk_psock(psock->sk) != psock in the backlog. The root cause is the special > behavior in sockmap_redir - it frequently performs map_update() and > map_delete() on the same socket. During map_update(), we create a new > psock and during map_delete(), we eventually free the psock via rcu_work > in sk_psock_drop(). However, pending workqueues might still exist and not > be processed yet. If users immediately perform another map_update(), a new > psock will be allocated for the same sk, resulting in two psocks pointing > to the same sk. > > When the pending workqueue is later triggered, it uses the old psock to > access sk for I/O operations, which is incorrect. > > Timing Diagram: > > cpu0 cpu1 > > map_update(sk): > sk->psock = psock1 > psock1->sk = sk > map_delete(sk): > rcu_work_free(psock1) > > map_update(sk): > sk->psock = psock2 > psock2->sk = sk > workqueue: > wakeup with psock1, but the sk of psock1 > doesn't belong to psock1 > rcu_handler: > clean psock1 > free(psock1) > > Previously, we used reference counting to address the concurrency issue > between backlog and sock_map_close(). This logic remains necessary as it > prevents the sk from being freed while processing the backlog. But this > patch prevents pending backlogs from using a psock after it has been > freed. Nit, its not that psock would be freed because we do have the cancel_delayed_work_sync() before the kfree(psock). But this is not a good state with two psocks referenceing the same sk. > > Note: We cannot call cancel_delayed_work_sync() in map_delete() since this > might be invoked in BPF context by BPF helper, and the function may sleep. > > Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") > Signed-off-by: Jiayuan Chen <jiayuan.chen@xxxxxxxxx> > > --- > Thanks to Michal Luczaj for providing the sockmap_redir test case, which > indeed covers almost all sockmap forwarding paths. > --- > include/linux/skmsg.h | 1 + > net/core/skmsg.c | 5 ++++- > 2 files changed, 5 insertions(+), 1 deletion(-) > > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h > index 0b9095a281b8..b17221eef2f4 100644 > --- a/include/linux/skmsg.h > +++ b/include/linux/skmsg.h > @@ -67,6 +67,7 @@ struct sk_psock_progs { > enum sk_psock_state_bits { > SK_PSOCK_TX_ENABLED, > SK_PSOCK_RX_STRP_ENABLED, > + SK_PSOCK_DROPPED, > }; > > struct sk_psock_link { > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 34c51eb1a14f..bd58a693ce9a 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -656,6 +656,9 @@ static void sk_psock_backlog(struct work_struct *work) > bool ingress; > int ret; > > + if (sk_psock_test_state(psock, SK_PSOCK_DROPPED)) > + return; Could we use the SK_PSOCK_TX_ENABLED bit here? Its already used to ensure we wont requeue work after the psock has started being removed. Seems like we don't need two flags? wdyt? > + > /* Increment the psock refcnt to synchronize with close(fd) path in > * sock_map_close(), ensuring we wait for backlog thread completion > * before sk_socket freed. If refcnt increment fails, it indicates > @@ -867,7 +870,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock) > write_unlock_bh(&sk->sk_callback_lock); > > sk_psock_stop(psock); Can we add this to sk_psock_stop where we have the TX_ENABLED bit cleared. > - > + sk_psock_set_state(psock, SK_PSOCK_DROPPED); > INIT_RCU_WORK(&psock->rwork, sk_psock_destroy); > queue_rcu_work(system_wq, &psock->rwork); > } > -- > 2.47.1 >