Issue with delayed segments despite TCP_NODELAY

Dennis Baurichter <dennisba@xxxxxxxxxxxxxxxxxxxxx> · Mon, 26 May 2025 02:44:10 +0200

Hi,

I have a question on why the kernel stops sending further TCP segments 
after the handshake and first 2 (or 3) payload segments have been sent. 
This seems to happen if the round trip time is "too high" (e.g., over 
9ms or 15ms, depending on system). Remaining segments are (apparently) 
only sent after an ACK has been received, even though TCP_NODELAY is set 
on the socket.

This is happening on a range of different kernels, from Arch Linux' 
6.14.7 (which should be rather close to mainline) down to Ubuntu 22.04's 
5.15.0-134-generic (admittedly somewhat "farther away" from mainline). I 
can test on an actual mainline kernel, too, if that helps.
I will describe our (probably somewhat uncommon) setup below. If you 
need any further information, I'll be happy to provide it.

My colleague and I have the following setup:
- Userland application connects to a server via TCP/IPv4 (complete TCP 
handshake is performed).
- A nftables rule is added to intercept packets of this connection and 
put them into a netfilter queue.
- Userland application writes data into this TCP socket.
  - The data is written in up to 4 chunks, which are intended to end up 
in individual TCP segments.
  - The socket has TCP_NODELAY set.
  - sysctl net.ipv4.tcp_autocorking=0
- The above nftables rule is removed.
- Userland application (a different part of it) retrieves all packets 
from the netfilter queue.
  - Here it may occur that e.g. only 2 out of 4 segments can be retrieved.
  - Reading from the netfilter queue is attempted until 5 timeouts of 
20ms each occured. Even much higher timeout values don't change the 
results, so it's not a race condition.
- Userland application performs some modifications on the intercepted 
segments and eventually issues verdict NF_ACCEPT.

We checked (via strace) that all payload chunks are successfully written 
to the socket, (via nlmon kernel module) that there are no errors in the 
netlink communication, and (via nft monitor) that indeed no further 
segments traverse the netfilter pipeline before the first two payload 
segments are actually sent on the wire.
We dug through the entire list of TCP and IPv4 sysctls (testing several 
of them), tried loading and using different congestion algorithm 
modules, toggling TCP_NODELAY off and on between each write to the 
socket (to trigger an explicit flush), and other things, but to no avail.

Modifying our code, we can see that after NF_ACCEPT'ing the first 
segments, we can retrieve the remaining segments from netfilter queue.
In Wireshark we see that this seems to be triggered by the incoming ACK 
segment from the server.

Notably, we can intercept all segments at once when testing this on 
localhost or in a LAN network. However, on long-distance / 
higher-latency connections, we can only intercept 2 (sometimes 3) segments.

Testing on a LAN connection from an old laptop to a fast PC, we delayed 
packets on the latter one with variants of:
tc qdisc add dev eth0 root netem delay 15ms
We got the following mappings of delay / rtt to number of segments 
intercepted:
below 15ms -> all (up to 4) segments intercepted
15-16ms -> 2-3 segments
16-17ms -> 2 (sometimes 3) segments
over 20ms -> 2 segments (tested 20ms, 200ms, 500ms)
Testing in the other direction, from fast PC to old laptop (which now 
has the qdisc delay), we get similar results, just with lower round trip 
times (15ms becomes more like 8-9ms).

We would very much appreciate it if someone could help us on the 
following questions:
- Why are the remaining segments not send out immediately, despite 
TCP_NODELAY?
- Is there a way to change this?
- If not, do you have better workarounds than injecting a fake ACK 
pretending to come "from the server" via a raw socket?
  Actually, we haven't tried this yet, but probably will soon.

Regards,
Dennis