Re: [Patch bpf-next v4 4/4] tcp_bpf: improve ingress redirection performance with message corking

Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> · Thu, 03 Jul 2025 13:32:08 +0200

On Thu, Jul 03, 2025 at 10:17 AM +08, Zijian Zhang wrote:
> On 7/2/25 8:17 PM, Jakub Sitnicki wrote:
>> On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote:
>>> From: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx>
>>>
>>> The TCP_BPF ingress redirection path currently lacks the message corking
>>> mechanism found in standard TCP. This causes the sender to wake up the
>>> receiver for every message, even when messages are small, resulting in
>>> reduced throughput compared to regular TCP in certain scenarios.
>> I'm curious what scenarios are you referring to? Is it send-to-local or
>> ingress-to-local? [1]
>> 
>
> Thanks for your attention and detailed reviewing!
> We are referring to "send-to-local" here.
>
>> If the sender is emitting small messages, that's probably intended -
>> that is they likely want to get the message across as soon as possible,
>> because They must have disabled the Nagle algo (set TCP_NODELAY) to do
>> that.
>> Otherwise, you get small segment merging on the sender side by default.
>> And if MTU is a limiting factor, you should also be getting batching
>> from GRO.
>> What I'm getting at is that I don't quite follow why you don't see
>> sufficient batching before the sockmap redirect today?
>> 
>
> IMHO,
>
> In “send-to-local” case, both sender and receiver sockets are added to
> the sockmap. Their protocol is modified from TCP to eBPF_TCP, so that
> sendmsg will invoke “tcp_bpf_sendmsg” instead of “tcp_sendmsg”. In this
> case, the whole process is building a skmsg and moving it to the
> receiver socket’s queue immediately. In this process, there is no
> sk_buff generated, and we cannot benefit from TCP stack optimizations.
> As a result, small segments will not be merged by default, that's the
> reason why I am implementing skmsg coalescing here.
>
>>> This change introduces a kernel worker-based intermediate layer to provide
>>> automatic message corking for TCP_BPF. While this adds a slight latency
>>> overhead, it significantly improves overall throughput by reducing
>>> unnecessary wake-ups and reducing the sock lock contention.
>> "Slight" for a +5% increase in latency is an understatement :-)
>> IDK about this being always on for every socket. For send-to-local
>> [1], sk_msg redirs can be viewed as a form of IPC, where latency
>> matters.
>> I do understand that you're trying to optimize for bulk-transfer
>> workloads, but please consider also request-response workloads.
>> [1]
>> https://github.com/jsitnicki/kubecon-2024-sockmap/blob/main/cheatsheet-sockmap-redirect.png
>> 
>
> Totally understand that request-response workloads are also very
> important.
>
> Here are my thoughts:
>
> I had an idea before: when the user sets NO_DELAY, we could follow the
> original code path. However, I'm concerned about a specific scenario: if
> a user sends part of a message and then sets NO_DELAY to send another
> message, it's possible that messages sent via kworker haven't yet
> reached the ingress_msg (maybe due to delayed kworker scheduling), while
> the messages sent with NO_DELAY have already arrived. This could disrupt
> the order. Since there's no TCP packet formation or retransmission
> mechanism in this process, once the order is disrupted, it stays that
> way.
>
> As a result, I propose,
>
> - When the user sets NO_DELAY, introduce a wait (I haven't determined
> the exact location yet; maybe in tcp_bpf_sendmsg) to ensure all messages
> from kworker are sent before proceeding. Then follow the original path
> for packet transmission.
>
> - When the user switches back from NO_DELAY to DELAY, it's less of an issue. We
>  can simply follow our current code path.
>
> If 5% degradation is not a blocker for this patchset, and the solution
> looks good to you, we will do it in the next patchset.

I'm all for reaping the benefits of batching, but I'm not thrilled about
having a backlog worker on the path. The one we have on the sk_skb path
has been a bottleneck:

1) There's no backpressure propagation so you can have a backlog
build-up. One thing to check is what happens if the receiver closes its
window.

2) There's a scheduling latency. That's why the performance of splicing
sockets with sockmap (ingress-to-egress) looks bleak [1].

So I have to dig deeper...

Have you considered and/or evaluated any alternative designs? For
instance, what stops us from having an auto-corking / coalescing
strategy on the sender side?

[1] https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/