Issue with delayed segments despite TCP_NODELAY

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have a question on why the kernel stops sending further TCP segments after the handshake and first 2 (or 3) payload segments have been sent. This seems to happen if the round trip time is "too high" (e.g., over 9ms or 15ms, depending on system). Remaining segments are (apparently) only sent after an ACK has been received, even though TCP_NODELAY is set on the socket.

This is happening on a range of different kernels, from Arch Linux' 6.14.7 (which should be rather close to mainline) down to Ubuntu 22.04's 5.15.0-134-generic (admittedly somewhat "farther away" from mainline). I can test on an actual mainline kernel, too, if that helps. I will describe our (probably somewhat uncommon) setup below. If you need any further information, I'll be happy to provide it.

My colleague and I have the following setup:
- Userland application connects to a server via TCP/IPv4 (complete TCP handshake is performed). - A nftables rule is added to intercept packets of this connection and put them into a netfilter queue.
- Userland application writes data into this TCP socket.
- The data is written in up to 4 chunks, which are intended to end up in individual TCP segments.
  - The socket has TCP_NODELAY set.
  - sysctl net.ipv4.tcp_autocorking=0
- The above nftables rule is removed.
- Userland application (a different part of it) retrieves all packets from the netfilter queue.
  - Here it may occur that e.g. only 2 out of 4 segments can be retrieved.
- Reading from the netfilter queue is attempted until 5 timeouts of 20ms each occured. Even much higher timeout values don't change the results, so it's not a race condition. - Userland application performs some modifications on the intercepted segments and eventually issues verdict NF_ACCEPT.

We checked (via strace) that all payload chunks are successfully written to the socket, (via nlmon kernel module) that there are no errors in the netlink communication, and (via nft monitor) that indeed no further segments traverse the netfilter pipeline before the first two payload segments are actually sent on the wire. We dug through the entire list of TCP and IPv4 sysctls (testing several of them), tried loading and using different congestion algorithm modules, toggling TCP_NODELAY off and on between each write to the socket (to trigger an explicit flush), and other things, but to no avail.

Modifying our code, we can see that after NF_ACCEPT'ing the first segments, we can retrieve the remaining segments from netfilter queue. In Wireshark we see that this seems to be triggered by the incoming ACK segment from the server.

Notably, we can intercept all segments at once when testing this on localhost or in a LAN network. However, on long-distance / higher-latency connections, we can only intercept 2 (sometimes 3) segments.

Testing on a LAN connection from an old laptop to a fast PC, we delayed packets on the latter one with variants of:
tc qdisc add dev eth0 root netem delay 15ms
We got the following mappings of delay / rtt to number of segments intercepted:
below 15ms -> all (up to 4) segments intercepted
15-16ms -> 2-3 segments
16-17ms -> 2 (sometimes 3) segments
over 20ms -> 2 segments (tested 20ms, 200ms, 500ms)
Testing in the other direction, from fast PC to old laptop (which now has the qdisc delay), we get similar results, just with lower round trip times (15ms becomes more like 8-9ms).

We would very much appreciate it if someone could help us on the following questions: - Why are the remaining segments not send out immediately, despite TCP_NODELAY?
- Is there a way to change this?
- If not, do you have better workarounds than injecting a fake ACK pretending to come "from the server" via a raw socket?
  Actually, we haven't tried this yet, but probably will soon.

Regards,
Dennis




[Index of Archives]     [Linux Netfilter Development]     [Linux Kernel Networking Development]     [Netem]     [Berkeley Packet Filter]     [Linux Kernel Development]     [Advanced Routing & Traffice Control]     [Bugtraq]

  Powered by Linux