Re: [Patch bpf-next v3 4/4] tcp_bpf: improve ingress redirection performance with message corking

John Fastabend <john.fastabend@xxxxxxxxx> · Fri, 30 May 2025 13:07:35 -0700

On 2025-05-19 13:36:28, Cong Wang wrote:
> From: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx>
> 
> The TCP_BPF ingress redirection path currently lacks the message corking
> mechanism found in standard TCP. This causes the sender to wake up the
> receiver for every message, even when messages are small, resulting in
> reduced throughput compared to regular TCP in certain scenarios.
> 
> This change introduces a kernel worker-based intermediate layer to provide
> automatic message corking for TCP_BPF. While this adds a slight latency
> overhead, it significantly improves overall throughput by reducing
> unnecessary wake-ups and reducing the sock lock contention.
> 
> Reviewed-by: Amery Hung <amery.hung@xxxxxxxxxxxxx>
> Co-developed-by: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> Signed-off-by: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx>
> ---
>  include/linux/skmsg.h |  19 ++++
>  net/core/skmsg.c      | 139 ++++++++++++++++++++++++++++-
>  net/ipv4/tcp_bpf.c    | 197 ++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 347 insertions(+), 8 deletions(-)

[...]

> +	/* At this point, the data has been handled well. If one of the
> +	 * following conditions is met, we can notify the peer socket in
> +	 * the context of this system call immediately.
> +	 * 1. If the write buffer has been used up;
> +	 * 2. Or, the message size is larger than TCP_BPF_GSO_SIZE;
> +	 * 3. Or, the ingress queue was empty;
> +	 * 4. Or, the tcp socket is set to no_delay.
> +	 * Otherwise, kick off the backlog work so that we can have some
> +	 * time to wait for any incoming messages before sending a
> +	 * notification to the peer socket.
> +	 */

OK this series looks like it should work to me. See one small comment
below. Also from the perf numbers in the cover letter is the latency
difference reduced/removed if the socket is set to no_delay?

> +	nonagle = tcp_sk(sk)->nonagle;
> +	if (!sk_stream_memory_free(sk) ||
> +	    tot_size >= TCP_BPF_GSO_SIZE || ingress_msg_empty ||
> +	    (!(nonagle & TCP_NAGLE_CORK) && (nonagle & TCP_NAGLE_OFF))) {
> +		release_sock(sk);
> +		psock->backlog_work_delayed = false;
> +		sk_psock_backlog_msg(psock);
> +		lock_sock(sk);
> +	} else {
> +		sk_psock_run_backlog_work(psock, false);
> +	}
> +
> +error:
> +	sk_psock_put(sk_redir, psock);
> +	return ret;
> +}
> +
>  static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>  				struct sk_msg *msg, int *copied, int flags)
>  {
> @@ -442,18 +619,24 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>  			cork = true;
>  			psock->cork = NULL;
>  		}
> -		release_sock(sk);
>  
> -		origsize = msg->sg.size;
> -		ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
> -					    msg, tosend, flags);
> -		sent = origsize - msg->sg.size;
> +		if (redir_ingress) {
> +			ret = tcp_bpf_ingress_backlog(sk, sk_redir, msg, tosend);
> +		} else {
> +			release_sock(sk);
> +
> +			origsize = msg->sg.size;
> +			ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
> +						    msg, tosend, flags);

nit, we can drop redir ingress at this point from tcp_bpf_sendmsg_redir?
It no longer handles ingress? A follow up patch would probably be fine.

> +			sent = origsize - msg->sg.size;
> +
> +			lock_sock(sk);
> +			sk_mem_uncharge(sk, sent);
> +		}
>  
>  		if (eval == __SK_REDIRECT)
>  			sock_put(sk_redir);

Thanks.