Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking

Zijian Zhang <zijianzhang@xxxxxxxxxxxxx> · Mon, 14 Jul 2025 17:26:19 -0700

On 7/8/25 1:51 AM, Jakub Sitnicki wrote:
On Thu, Jul 03, 2025 at 09:20 PM -07, Cong Wang wrote:
On Thu, Jul 03, 2025 at 01:32:08PM +0200, Jakub Sitnicki wrote:
I'm all for reaping the benefits of batching, but I'm not thrilled about
having a backlog worker on the path. The one we have on the sk_skb path
has been a bottleneck:

It depends on what you compare with. If you compare it with vanilla
TCP_BPF, we did see is 5% latency increase. If you compare it with
regular TCP, it is still much better. Our goal is to make Cillium's
sockops-enable competitive with regular TCP, hence we compare it with
regular TCP.

I hope this makes sense to you. Sorry if this was not clear in our cover
letter.

Latency-wise I think we should be comparing sk_msg send-to-local against
UDS rather than full-stack TCP.

There is quite a bit of guessing on my side as to what you're looking
for because the cover letter doesn't say much about the use case.

Let me add more details to the use cases,

Assume user space code uses TCP to connect to a peer which may be
local or remote. We are trying to use sockmap to transparently
accelerate the TCP connection where both the sender and the receiver are
on the same machine. User space code does not need to be modified, local
connections will be accelerated, remote connections remain the same.
Because of the transparency here, UDS is not an option here. UDS
requires user-space code change, and it means users know they are
talking to local peer.

We assume that since we bypass the Linux network stack, better tput,
latency and cpu usage will be observed. However, it's not ths case, tput
is worse when the message size is small (<64k).

It's similar to cilium "sockops-enable" config, which is deprecated
mostly because of performance. The config uses sockmap to manage the
TCP connection between pods in the same machine.

https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c

For instance, do you control the sender?  Why not do big writes on the
sender side if raw throughput is what you care about?

As described above, we assume user space uses TCP, and we cannot change
the user space code.

1) There's no backpressure propagation so you can have a backlog
build-up. One thing to check is what happens if the receiver closes its
window.

Right, I am sure there are still a lot of optimizations we can further
improve. The only question is how much we need for now. How about
optimizing it one step each time? :)

This is introducing a quite a bit complexity from the start. I'd like to
least explore if it can be done in a simpler fashion before committing to
it.

You point at wake-ups as being the throughput killer. As an alternative,
can we wake up the receiver conditionally? That is only if the receiver
has made progress since on the queue since the last notification. This
could also be a form of wakeup moderation.

wake-up is indeed one of the throughput killer, and I agree it can be
mitigated by waking up the receiver conditionally.

IIRC, sock lock is another __main__ throughput killer,
In the tcp_bpf_sendmsg, the context of sender process,
we need to lock_sock(sender) -> release_sock(sender) -> lock_sock(recv)
-> release_sock(recv) -> lock_sock(sender) -> release_sock(sender).

This makes the sender somewhat dependent to the receiver, when the 
receiver is working, the sender will be blocked.

   sender                      receiver
tcp_bpf_sendmsg
                           tcp_bpf_recvmsg (working)
tcp_bpf_sendmsg (blocked)

We introduce kworker here mainly to solve the sock lock issue, we want
to have senders only need to acquire sender sock lock, receivers only
need to acquire receiver sock lock. Only the kworker, as a middle man,
needs to have both sender and receiver lock to transfer the data from
the sender to the receiver. As a result, tcp_bpf_sendmsg and
tcp_bpf_recvmsg can be independent to each other.

   sender                      receiver
tcp_bpf_sendmsg
                           tcp_bpf_recvmsg (working)
tcp_bpf_sendmsg
tcp_bpf_sendmsg
...

2) There's a scheduling latency. That's why the performance of splicing
sockets with sockmap (ingress-to-egress) looks bleak [1].

Same for regular TCP, we have to wakeup the receiver/worker. But I may
misunderstand this point?

What I meant is that, in the pessimistic case, to deliver a message we
now have to go through two wakeups:

sender -wakeup-> kworker -wakeup-> receiver

So I have to dig deeper...

Have you considered and/or evaluated any alternative designs? For
instance, what stops us from having an auto-corking / coalescing
strategy on the sender side?

Auto corking _may_ be not as easy as TCP, since essentially we have no
protocol here, just a pure socket layer.

You're right. We don't have a flush signal for auto-corking on the
sender side with sk_msg's.

What about what I mentioned above - can we moderate the wakeups based on
receiver making progress? Does that sound feasible to you?

Thanks,
-jkbs