On 2025-05-19 13:36:28, Cong Wang wrote: > From: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx> > > The TCP_BPF ingress redirection path currently lacks the message corking > mechanism found in standard TCP. This causes the sender to wake up the > receiver for every message, even when messages are small, resulting in > reduced throughput compared to regular TCP in certain scenarios. > > This change introduces a kernel worker-based intermediate layer to provide > automatic message corking for TCP_BPF. While this adds a slight latency > overhead, it significantly improves overall throughput by reducing > unnecessary wake-ups and reducing the sock lock contention. > > Reviewed-by: Amery Hung <amery.hung@xxxxxxxxxxxxx> > Co-developed-by: Cong Wang <cong.wang@xxxxxxxxxxxxx> > Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx> > Signed-off-by: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx> > --- > include/linux/skmsg.h | 19 ++++ > net/core/skmsg.c | 139 ++++++++++++++++++++++++++++- > net/ipv4/tcp_bpf.c | 197 ++++++++++++++++++++++++++++++++++++++++-- > 3 files changed, 347 insertions(+), 8 deletions(-) [...] > + /* At this point, the data has been handled well. If one of the > + * following conditions is met, we can notify the peer socket in > + * the context of this system call immediately. > + * 1. If the write buffer has been used up; > + * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE; > + * 3. Or, the ingress queue was empty; > + * 4. Or, the tcp socket is set to no_delay. > + * Otherwise, kick off the backlog work so that we can have some > + * time to wait for any incoming messages before sending a > + * notification to the peer socket. > + */ OK this series looks like it should work to me. See one small comment below. Also from the perf numbers in the cover letter is the latency difference reduced/removed if the socket is set to no_delay? > + nonagle = tcp_sk(sk)->nonagle; > + if (!sk_stream_memory_free(sk) || > + tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty || > + (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) { > + release_sock(sk); > + psock->backlog_work_delayed = false; > + sk_psock_backlog_msg(psock); > + lock_sock(sk); > + } else { > + sk_psock_run_backlog_work(psock, false); > + } > + > +error: > + sk_psock_put(sk_redir, psock); > + return ret; > +} > + > static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, > struct sk_msg *msg, int *copied, int flags) > { > @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, > cork = true; > psock->cork = NULL; > } > - release_sock(sk); > > - origsize = msg->sg.size; > - ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, > - msg, tosend, flags); > - sent = origsize - msg->sg.size; > + if (redir_ingress) { > + ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend); > + } else { > + release_sock(sk); > + > + origsize = msg->sg.size; > + ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, > + msg, tosend, flags); nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir? It no longer handles ingress? A follow up patch would probably be fine. > + sent = origsize - msg->sg.size; > + > + lock_sock(sk); > + sk_mem_uncharge(sk, sent); > + } > > if (eval == __SK_REDIRECT) > sock_put(sk_redir); Thanks.