Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking

Zijian Zhang <zijianzhang@xxxxxxxxxxxxx> · Fri, 30 May 2025 13:37:37 -0700




On 5/30/25 1:07 PM, John Fastabend wrote:
On 2025-05-19 13:36:28, Cong Wang wrote:
From: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx>

The TCP_BPF ingress redirection path currently lacks the message corking
mechanism found in standard TCP. This causes the sender to wake up the
receiver for every message, even when messages are small, resulting in
reduced throughput compared to regular TCP in certain scenarios.

This change introduces a kernel worker-based intermediate layer to provide
automatic message corking for TCP_BPF. While this adds a slight latency
overhead, it significantly improves overall throughput by reducing
unnecessary wake-ups and reducing the sock lock contention.

Reviewed-by: Amery Hung <amery.hung@xxxxxxxxxxxxx>
Co-developed-by: Cong Wang <cong.wang@xxxxxxxxxxxxx>
Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx>
Signed-off-by: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx>
---
  include/linux/skmsg.h |  19 ++++
  net/core/skmsg.c      | 139 ++++++++++++++++++++++++++++-
  net/ipv4/tcp_bpf.c    | 197 ++++++++++++++++++++++++++++++++++++++++--
  3 files changed, 347 insertions(+), 8 deletions(-)

[...]

+	/* At this point, the data has been handled well. If one of the
+	 * following conditions is met, we can notify the peer socket in
+	 * the context of this system call immediately.
+	 * 1. If the write buffer has been used up;
+	 * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
+	 * 3. Or, the ingress queue was empty;
+	 * 4. Or, the tcp socket is set to no_delay.
+	 * Otherwise, kick off the backlog work so that we can have some
+	 * time to wait for any incoming messages before sending a
+	 * notification to the peer socket.
+	 */


OK this series looks like it should work to me. See one small comment
below. Also from the perf numbers in the cover letter is the latency
difference reduced/removed if the socket is set to no_delay?


Even if the socket is set to no_delay, we still have minor latency diff.
The main reason is that we now have dynamic allocation for skmsg and
kworker in the middle, the path is more complex now.

+	nonagle = tcp_sk(sk)->nonagle;
+	if (!sk_stream_memory_free(sk) ||
+	    tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
+	    (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
+		release_sock(sk);
+		psock->backlog_work_delayed = false;
+		sk_psock_backlog_msg(psock);
+		lock_sock(sk);
+	} else {
+		sk_psock_run_backlog_work(psock, false);
+	}
+
+error:
+	sk_psock_put(sk_redir, psock);
+	return ret;
+}
+
  static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
  				struct sk_msg *msg, int *copied, int flags)
  {
@@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
  			cork = true;
  			psock->cork = NULL;
  		}
-		release_sock(sk);
  
-		origsize = msg->sg.size;
-		ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
-					    msg, tosend, flags);
-		sent = origsize - msg->sg.size;
+		if (redir_ingress) {
+			ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
+		} else {
+			release_sock(sk);
+
+			origsize = msg->sg.size;
+			ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
+						    msg, tosend, flags);

nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir?
It no longer handles ingress? A follow up patch would probably be fine.


Indeed, we will do this in a follow up patch.

+			sent = origsize - msg->sg.size;
+
+			lock_sock(sk);
+			sk_mem_uncharge(sk, sent);
+		}
  
  		if (eval == __SK_REDIRECT)
  			sock_put(sk_redir);

Thanks.

Thanks for the review!