May 29, 2025 at 07:46, "John Fastabend" <john.fastabend@xxxxxxxxx> wrote: > > On 2025-05-24 00:22:19, Jiayuan Chen wrote: > > > > > We observed an issue from the latest selftest: sockmap_redir where > > > > sk_psock(psock->sk) != psock in the backlog. The root cause is the special > > > > behavior in sockmap_redir - it frequently performs map_update() and > > > > map_delete() on the same socket. During map_update(), we create a new > > > > psock and during map_delete(), we eventually free the psock via rcu_work > > > > in sk_psock_drop(). However, pending workqueues might still exist and not > > > > be processed yet. If users immediately perform another map_update(), a new > > > > psock will be allocated for the same sk, resulting in two psocks pointing > > > > to the same sk. > > > > > > > > When the pending workqueue is later triggered, it uses the old psock to > > > > access sk for I/O operations, which is incorrect. > > > > > > > > Timing Diagram: > > > > > > > > cpu0 cpu1 > > > > > > > > map_update(sk): > > > > sk->psock = psock1 > > > > psock1->sk = sk > > > > map_delete(sk): > > > > rcu_work_free(psock1) > > > > > > > > map_update(sk): > > > > sk->psock = psock2 > > > > psock2->sk = sk > > > > workqueue: > > > > wakeup with psock1, but the sk of psock1 > > > > doesn't belong to psock1 > > > > rcu_handler: > > > > clean psock1 > > > > free(psock1) > > > > > > > > Previously, we used reference counting to address the concurrency issue > > > > between backlog and sock_map_close(). This logic remains necessary as it > > > > prevents the sk from being freed while processing the backlog. But this > > > > patch prevents pending backlogs from using a psock after it has been > > > > freed. > > > > Nit, its not that psock would be freed because we do have the > > cancel_delayed_work_sync() before the kfree(psock). But this > > is not a good state with two psocks referenceing the same sk. > > > > > Note: We cannot call cancel_delayed_work_sync() in map_delete() since this > > > > might be invoked in BPF context by BPF helper, and the function may sleep. > > > > > > > > Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") > > > > Signed-off-by: Jiayuan Chen <jiayuan.chen@xxxxxxxxx> > > > > > > > > --- > > > > Thanks to Michal Luczaj for providing the sockmap_redir test case, which > > > > indeed covers almost all sockmap forwarding paths. > > > > --- > > > > include/linux/skmsg.h | 1 + > > > > net/core/skmsg.c | 5 ++++- > > > > 2 files changed, 5 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h > > > > index 0b9095a281b8..b17221eef2f4 100644 > > > > --- a/include/linux/skmsg.h > > > > +++ b/include/linux/skmsg.h > > > > @@ -67,6 +67,7 @@ struct sk_psock_progs { > > > > enum sk_psock_state_bits { > > > > SK_PSOCK_TX_ENABLED, > > > > SK_PSOCK_RX_STRP_ENABLED, > > > > + SK_PSOCK_DROPPED, > > > > }; > > > > > > > > struct sk_psock_link { > > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > > > index 34c51eb1a14f..bd58a693ce9a 100644 > > > > --- a/net/core/skmsg.c > > > > +++ b/net/core/skmsg.c > > > > @@ -656,6 +656,9 @@ static void sk_psock_backlog(struct work_struct *work) > > > > bool ingress; > > > > int ret; > > > > > > > > + if (sk_psock_test_state(psock, SK_PSOCK_DROPPED)) > > > > + return; > > > > Could we use the SK_PSOCK_TX_ENABLED bit here? Its already used to > > ensure we wont requeue work after the psock has started being > > removed. Seems like we don't need two flags? wdyt? > > > > > + > > > > /* Increment the psock refcnt to synchronize with close(fd) path in > > > > * sock_map_close(), ensuring we wait for backlog thread completion > > > > * before sk_socket freed. If refcnt increment fails, it indicates > > > > @@ -867,7 +870,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock) > > > > write_unlock_bh(&sk->sk_callback_lock); > > > > > > > > sk_psock_stop(psock); > > > > Can we add this to sk_psock_stop where we have the TX_ENABLED bit > > cleared. Thanks, I just add SK_PSOCK_TX_ENABLED checking at the start of sk_psock_backlog(). Every works fine, and truly no more flag needed ! diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 34c51eb1a14f..83c78379932e 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -656,6 +656,13 @@ static void sk_psock_backlog(struct work_struct *work) bool ingress; int ret; + /* If sk is quickly removed from the map and then added back, the old + * psock should not be scheduled, because there are now two psocks + * pointing to the same sk. + */ + if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) + return; + /* Increment the psock refcnt to synchronize with close(fd) path in * sock_map_close(), ensuring we wait for backlog thread completion * before sk_socket freed. If refcnt increment fails, it indicates > > > > - > > > > + sk_psock_set_state(psock, SK_PSOCK_DROPPED); > > > > INIT_RCU_WORK(&psock->rwork, sk_psock_destroy); > > > > queue_rcu_work(system_wq, &psock->rwork); > > > > } > > > > -- > > > > 2.47.1 > > >