Re: [RFC PATCH bpf-next 00/14] bpf: Efficient socket destruction

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Fri, 12 Sep 2025 10:29:39 -0700

On 9/9/25 9:59 AM, Jordan Rife wrote:
MOTIVATION
==========
In Cilium we use SOCK_ADDR hooks (cgroup/connect4, cgroup/sendmsg4, ...)
to do socket-level load balancing, translating service VIPs to real
backend IPs. This is more efficient than per-packet service VIP
translation, but there's a consequence: UDP sockets connected to a stale
backend will keep trying to talk to it once its gone instead of traffic
being redirected to an active backend. To bridge this gap, we forcefully
terminate such sockets from the control plane, forcing applications to
recreate these sockets and start talking to an active backend. In the
past, we've used netlink + sock_diag for this purpose, but have started
using BPF socket iterators coupled with bpf_sock_destroy() in an effort
to do most dataplane management in BPF and improve the efficiency of
socket termination. bpf_sock_destroy() was introduced by Aditi for this
very purpose in [1]. More recently, this kind of forceful socket
destruction was extended to cover TCP sockets as well so that they more
quickly receive a reset when the backend they're connected to goes away
instead of relying on timeouts [2].

When a backend goes away, the process to destroy all sockets connected
to that backend looks roughly like this:

for each network namespace:
     enter the network namespace
     create a socket iterator
     for each socket in the network namespace:
         run the iterator BPF program:
             if sk was connected to the backend:
                 bpf_sock_destroy(sk)

Clearly, this creates a lot of repeated work, and it became evident in
scale tests that create many sockets or frequent service backend churn
that this approach won't scale well.

For a simple illustration, I set up a scenario where there are one
hundred different workloads each running in their own network namespace
and observed the time it took to iterate through all namespaces and
sockets to destroy a handful of connected sockets in those namespaces.

How many sockets were destroyed?

I repeated this five times, each time increasing the number of sockets
in the system's UDP hash by 10x using a script that creates lots of
connected sockets.

                     +---------+----------------+
                     | Sockets | Iteration Time |
                     +---------+----------------+
                     | 100     | 6.35ms         |
                     | 1000    | 4.03ms         |
                     | 10000   | 20.0ms         |
                     | 100000  | 103ms          |
                     | 1000000 | 9.38s          |
                     +---------+----------------+
                       Namespaces = 100
                       [CPU] AMD Ryzen 9 9900X

Iteration takes longer as more sockets are added. All the while, CPU
utilization is high with `perf top` showing `bpf_iter_udp_batch` at the
top:

   70.58%  [kernel]                 [k] bpf_iter_udp_batch

Although this example uses UDP sockets, a similar trend should be
present with TCP sockets and iterators as well. Even low numbers of
sockets and sub-second times can be problematic in clusters with high
churn or where a burst of backend deletions occurs.

For TCP, is it possible to abort the connection in BPF_SOCK_OPS_RTO_CB to stop 
the retry? RTO is not a per packet event.

Does it have a lot of UDP connected sockets left to iterate in production?

This can be slightly improved by doing some extra bookkeeping that lets
us skip certain namespaces that we know don't contain sockets connected
to the backend, but in general we're boxed in by three limitations:

1. BPF socket iterators scan through every socket in the system's UDP or
    TCP socket hash tables to find those belonging to the current network
    namespace, since by default all namespaces share the same set of
    global tables. As the number of sockets in a system grows, more time
    will be spent filtering out unrelated sockets. You could use
    udp_child_hash_entries and tcp_child_ehash_entries to give each

I assume the sockets that need to be destroyed could be in different child 
hashtables (i.e. in different netns) even child_[e]hash is used?

    namespace its own table and avoid these noisy neighbor effects, but
    managing this automatically for each workload is tricky, uses more
    memory than necessary, and still doesn't avoid unnecessary filtering,
    because...
2. ...it's necessary to visit all sockets in a network namespace to find
    the one(s) you're looking for, since there's no predictible order in
    the system hash tables. Similar to the last point, this creates
    unnecessary work.
3. bpf_sock_destroy() only works from BPF socket iterator contexts
    currently.

OVERVIEW
========
It would be ideal if we could visit only the set of sockets we're
interested in without lots of wasteful filtering. This patch series
seeks to enable this with the following changes:

* Making bpf_sock_destroy() work with BPF_MAP_TYPE_SOCKHASH map
   iterators.
* Enabling control over bucketing behavior of BPF_MAP_TYPE_SOCKHASH to
   ensure that all sockets sharing the same key prefix are grouped in
   the same bucket.
* Adding a key prefix filter to BPF_MAP_TYPE_SOCKHASH map iterators that
   limits iteration to only the bucket containing keys with the given
   prefix, and therefore, a single bucket.
* A new sockops event, BPF_SOCK_OPS_UDP_CONNECTED_CB, that allows us to
   automatically insert connected UDP sockets into a
   BPF_MAP_TYPE_SOCKHASH in the same way
   BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB does for connect()ed TCP sockets.

This gives us the means to maintain a socket index where we can
efficiently retrieve and destroy the set of sockets sharing some common
property, in our case the backend address, without any additional
iteration or filtering.

The basic idea looks like this:

* `map_extra` may be used to specify the number of bytes from the key
   that a BPF_MAP_TYPE_SOCKHASH uses to determine a socket's hash bucket.

   ```
   struct sock_hash_key {
           __u32 bucket_key;
           __u64 cookie;
   } __packed;

   struct {
           __uint(type, BPF_MAP_TYPE_SOCKHASH);
           __uint(max_entries, 16);
           __ulong(map_extra, offsetof(struct sock_hash_key, cookie));
           __type(key, struct sock_hash_key);
           __type(value, __u64);
   } sock_hash SEC(".maps");
   ```

   In this example, all keys sharing the same `bucket_key` would be
   bucketed together. In our case, `bucket_key` would be replaced with a
   backend ID or (destination address, port) tuple.

Before diving into the discussion whether it is a good idea to add another key 
to a bpf hashmap, it seems that a hashmap does not actually fit your use case. A 
different data structure (or at least a different way of grouping sk) is needed. 
Have you considered using the bpf_list_head/bpf_rb_root/bpf_arena? Potentially, 
the sk could be stored as a __kptr but I don't think it is supported yet, aside 
from considerations when sk is closed, etc. However, it can store the numeric 
ip/port and then use the bpf_sk_lookup helper, which can take netns_id. 
Iteration could potentially be done in a sleepable SEC("syscall") program in 
test_prog_run, where lock_sock is allowed. TCP sockops has a state change 
callback (i.e. for tracking TCP_CLOSE) but connected udp does not have it now.

* `key_prefix` may be used to parametrize a BPF_MAP_TYPE_SOCKHASH map
   iterator so that it only visits the bucket matching that key prefix.

   ```
   union bpf_iter_link_info {
           struct {
                  __u32 map_fd;
                  union {
                          /* Parameters for socket hash iterators. */
                          struct {
                                   __aligned_u64 key_prefix;
                                   __u32         key_prefix_len;
                          } sock_hash;
	         };
           } map;
	...
   };
   ```
* The contents of the BPF_MAP_TYPE_SOCKHASH are automatically managed
   using a sockops program that inserts connected TCP and UDP sockets
   into the map.