On Thu, Jul 03, 2025 at 10:17 AM +08, Zijian Zhang wrote: > On 7/2/25 8:17 PM, Jakub Sitnicki wrote: >> On Mon, Jun 30, 2025 at 06:12 PM -07, Cong Wang wrote: >>> From: Zijian Zhang <zijianzhang@xxxxxxxxxxxxx> >>> >>> The TCP_BPF ingress redirection path currently lacks the message corking >>> mechanism found in standard TCP. This causes the sender to wake up the >>> receiver for every message, even when messages are small, resulting in >>> reduced throughput compared to regular TCP in certain scenarios. >> I'm curious what scenarios are you referring to? Is it send-to-local or >> ingress-to-local? [1] >> > > Thanks for your attention and detailed reviewing! > We are referring to "send-to-local" here. > >> If the sender is emitting small messages, that's probably intended - >> that is they likely want to get the message across as soon as possible, >> because They must have disabled the Nagle algo (set TCP_NODELAY) to do >> that. >> Otherwise, you get small segment merging on the sender side by default. >> And if MTU is a limiting factor, you should also be getting batching >> from GRO. >> What I'm getting at is that I don't quite follow why you don't see >> sufficient batching before the sockmap redirect today? >> > > IMHO, > > In “send-to-local” case, both sender and receiver sockets are added to > the sockmap. Their protocol is modified from TCP to eBPF_TCP, so that > sendmsg will invoke “tcp_bpf_sendmsg” instead of “tcp_sendmsg”. In this > case, the whole process is building a skmsg and moving it to the > receiver socket’s queue immediately. In this process, there is no > sk_buff generated, and we cannot benefit from TCP stack optimizations. > As a result, small segments will not be merged by default, that's the > reason why I am implementing skmsg coalescing here. > >>> This change introduces a kernel worker-based intermediate layer to provide >>> automatic message corking for TCP_BPF. While this adds a slight latency >>> overhead, it significantly improves overall throughput by reducing >>> unnecessary wake-ups and reducing the sock lock contention. >> "Slight" for a +5% increase in latency is an understatement :-) >> IDK about this being always on for every socket. For send-to-local >> [1], sk_msg redirs can be viewed as a form of IPC, where latency >> matters. >> I do understand that you're trying to optimize for bulk-transfer >> workloads, but please consider also request-response workloads. >> [1] >> https://github.com/jsitnicki/kubecon-2024-sockmap/blob/main/cheatsheet-sockmap-redirect.png >> > > Totally understand that request-response workloads are also very > important. > > Here are my thoughts: > > I had an idea before: when the user sets NO_DELAY, we could follow the > original code path. However, I'm concerned about a specific scenario: if > a user sends part of a message and then sets NO_DELAY to send another > message, it's possible that messages sent via kworker haven't yet > reached the ingress_msg (maybe due to delayed kworker scheduling), while > the messages sent with NO_DELAY have already arrived. This could disrupt > the order. Since there's no TCP packet formation or retransmission > mechanism in this process, once the order is disrupted, it stays that > way. > > As a result, I propose, > > - When the user sets NO_DELAY, introduce a wait (I haven't determined > the exact location yet; maybe in tcp_bpf_sendmsg) to ensure all messages > from kworker are sent before proceeding. Then follow the original path > for packet transmission. > > - When the user switches back from NO_DELAY to DELAY, it's less of an issue. We > can simply follow our current code path. > > If 5% degradation is not a blocker for this patchset, and the solution > looks good to you, we will do it in the next patchset. I'm all for reaping the benefits of batching, but I'm not thrilled about having a backlog worker on the path. The one we have on the sk_skb path has been a bottleneck: 1) There's no backpressure propagation so you can have a backlog build-up. One thing to check is what happens if the receiver closes its window. 2) There's a scheduling latency. That's why the performance of splicing sockets with sockmap (ingress-to-egress) looks bleak [1]. So I have to dig deeper... Have you considered and/or evaluated any alternative designs? For instance, what stops us from having an auto-corking / coalescing strategy on the sender side? [1] https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/