MOTIVATION ========== In Cilium we use SOCK_ADDR hooks (cgroup/connect4, cgroup/sendmsg4, ...) to do socket-level load balancing, translating service VIPs to real backend IPs. This is more efficient than per-packet service VIP translation, but there's a consequence: UDP sockets connected to a stale backend will keep trying to talk to it once its gone instead of traffic being redirected to an active backend. To bridge this gap, we forcefully terminate such sockets from the control plane, forcing applications to recreate these sockets and start talking to an active backend. In the past, we've used netlink + sock_diag for this purpose, but have started using BPF socket iterators coupled with bpf_sock_destroy() in an effort to do most dataplane management in BPF and improve the efficiency of socket termination. bpf_sock_destroy() was introduced by Aditi for this very purpose in [1]. More recently, this kind of forceful socket destruction was extended to cover TCP sockets as well so that they more quickly receive a reset when the backend they're connected to goes away instead of relying on timeouts [2]. When a backend goes away, the process to destroy all sockets connected to that backend looks roughly like this: for each network namespace: enter the network namespace create a socket iterator for each socket in the network namespace: run the iterator BPF program: if sk was connected to the backend: bpf_sock_destroy(sk) Clearly, this creates a lot of repeated work, and it became evident in scale tests that create many sockets or frequent service backend churn that this approach won't scale well. For a simple illustration, I set up a scenario where there are one hundred different workloads each running in their own network namespace and observed the time it took to iterate through all namespaces and sockets to destroy a handful of connected sockets in those namespaces. I repeated this five times, each time increasing the number of sockets in the system's UDP hash by 10x using a script that creates lots of connected sockets. +---------+----------------+ | Sockets | Iteration Time | +---------+----------------+ | 100 | 6.35ms | | 1000 | 4.03ms | | 10000 | 20.0ms | | 100000 | 103ms | | 1000000 | 9.38s | +---------+----------------+ Namespaces = 100 [CPU] AMD Ryzen 9 9900X Iteration takes longer as more sockets are added. All the while, CPU utilization is high with `perf top` showing `bpf_iter_udp_batch` at the top: 70.58% [kernel] [k] bpf_iter_udp_batch Although this example uses UDP sockets, a similar trend should be present with TCP sockets and iterators as well. Even low numbers of sockets and sub-second times can be problematic in clusters with high churn or where a burst of backend deletions occurs. This can be slightly improved by doing some extra bookkeeping that lets us skip certain namespaces that we know don't contain sockets connected to the backend, but in general we're boxed in by three limitations: 1. BPF socket iterators scan through every socket in the system's UDP or TCP socket hash tables to find those belonging to the current network namespace, since by default all namespaces share the same set of global tables. As the number of sockets in a system grows, more time will be spent filtering out unrelated sockets. You could use udp_child_hash_entries and tcp_child_ehash_entries to give each namespace its own table and avoid these noisy neighbor effects, but managing this automatically for each workload is tricky, uses more memory than necessary, and still doesn't avoid unnecessary filtering, because... 2. ...it's necessary to visit all sockets in a network namespace to find the one(s) you're looking for, since there's no predictible order in the system hash tables. Similar to the last point, this creates unnecessary work. 3. bpf_sock_destroy() only works from BPF socket iterator contexts currently. OVERVIEW ======== It would be ideal if we could visit only the set of sockets we're interested in without lots of wasteful filtering. This patch series seeks to enable this with the following changes: * Making bpf_sock_destroy() work with BPF_MAP_TYPE_SOCKHASH map iterators. * Enabling control over bucketing behavior of BPF_MAP_TYPE_SOCKHASH to ensure that all sockets sharing the same key prefix are grouped in the same bucket. * Adding a key prefix filter to BPF_MAP_TYPE_SOCKHASH map iterators that limits iteration to only the bucket containing keys with the given prefix, and therefore, a single bucket. * A new sockops event, BPF_SOCK_OPS_UDP_CONNECTED_CB, that allows us to automatically insert connected UDP sockets into a BPF_MAP_TYPE_SOCKHASH in the same way BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB does for connect()ed TCP sockets. This gives us the means to maintain a socket index where we can efficiently retrieve and destroy the set of sockets sharing some common property, in our case the backend address, without any additional iteration or filtering. The basic idea looks like this: * `map_extra` may be used to specify the number of bytes from the key that a BPF_MAP_TYPE_SOCKHASH uses to determine a socket's hash bucket. ``` struct sock_hash_key { __u32 bucket_key; __u64 cookie; } __packed; struct { __uint(type, BPF_MAP_TYPE_SOCKHASH); __uint(max_entries, 16); __ulong(map_extra, offsetof(struct sock_hash_key, cookie)); __type(key, struct sock_hash_key); __type(value, __u64); } sock_hash SEC(".maps"); ``` In this example, all keys sharing the same `bucket_key` would be bucketed together. In our case, `bucket_key` would be replaced with a backend ID or (destination address, port) tuple. * `key_prefix` may be used to parametrize a BPF_MAP_TYPE_SOCKHASH map iterator so that it only visits the bucket matching that key prefix. ``` union bpf_iter_link_info { struct { __u32 map_fd; union { /* Parameters for socket hash iterators. */ struct { __aligned_u64 key_prefix; __u32 key_prefix_len; } sock_hash; }; } map; ... }; ``` * The contents of the BPF_MAP_TYPE_SOCKHASH are automatically managed using a sockops program that inserts connected TCP and UDP sockets into the map. SERIES STRUCTURE ================ * Part one makes bpf_sock_destroy() usable from BPF_MAP_TYPE_SOCKHASH and BPF_MAP_TYPE_SOCKMAP map iterators culimnating in the expansion of the existing bpf_sock_destroy() selftests to cover its use from these new contexts. I was unsure about whether or not to include the changes to BPF_MAP_TYPE_SOCKMAP, since the use case I describe above would only make use of BPF_MAP_TYPE_SOCKHASH, but it felt strange to make one compatible with bpf_sock_destroy() and not the other. So, for now I've included it in this series. * Part two enables key prefix-based bucketing control and map iterator filtering for BPF_MAP_TYPE_SOCKHASH, making it possible to efficiently iterate through and destroy a set of sockets whose keys share a common prefix. * Part three introduces the BPF_SOCK_OPS_UDP_CONNECTED_CB sockops callback, a new event that happens at the end of connect() for UDP sockets. Again, I was on the fence about whether or not I wanted to include this in the series, since it feels like it could be its own standalone change, but ultimately to make the other changes useful there needs to be a way to automatically manage the contents of the BPF_MAP_TYPE_SOCKHASH for connect()ed UDP sockets in the same way that BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB works for TCP sockets. I've added this to the end of the series along with a set of tests that show the desired end-to-end flow from insertion of TCP and UDP sockets into the map to iteration and destruction using a key prefix filter. COMPARISON ========== Using the same experiment as before... +---------+----------------------+----------------------+ | Sockets | Iteration Time (Old) | Iteration Time (New) | +---------+----------------------+----------------------+ | 100 | 6.35ms | 54.2us | | 1000 | 4.03ms | 43.7us | | 10000 | 20.0ms | 44.1us | | 100000 | 103ms | 40.8us | | 1000000 | 9.38s | 47.6us | +---------+---------------------------------------------+ Namespaces = 100 [CPU] AMD Ryzen 9 9900X ...iteration time remained constant regardless of the number of sockets that exist, and even comparing with the best case time (4.03ms) for destruction using socket iterators, the map-based approach was ~98% faster. As a secondary benefit, this also simplifies the control plane and iterator program logic, as we don't need to worry about iterating through different namespaces or filtering. +---------+--------------------+ | Sockets | Memory Utilization | +---------+--------------------+ | 100 | 16.0MiB | | 1000 | 16.0MiB | | 10000 | 16.7MiB | | 100000 | 22.7MiB | | 1000000 | 84.7MiB | +---------+--------------------+ key_size = 16B max_entries = 1 << 20 Looking at memory, a socket hash with `1 << 20` max_entries uses ~16.0MiB empty and ~84.7MiB with 1000000 entries (almost full), which seems like a decent compute/memory tradeoff. One possible drawback/concern of this approach is that, if you're not careful, low key prefix cardinality can lead to large buckets and slow down connect() calls. I encountered this situation while running these tests when I accidentally created thousands of UDP sockets connected to the same (addr, port) tuple and saw connect() calls take longer and longer, since all sockets were added to the same bucket. In practice, I'm not sure how much of a concern this is, but it's certainly possible to slow down connect() for certain endpoint addresses this way. Further partitioning could be done by adding new dimensions to the keys, e.g. network namespace cookie, to better mitigate and isolate these effects if needed. Thanks for any feedback! Jordan [1]: https://lore.kernel.org/bpf/20230519225157.760788-1-aditi.ghag@xxxxxxxxxxxxx/ [2]: https://github.com/cilium/cilium/pull/40304 Jordan Rife (14): bpf: Use reference counting for struct bpf_shtab_elem bpf: Hold socket lock in socket hash iterator bpf: Hold socket lock in socket map iterator bpf: Mark sk as PTR_TRUSTED in sockmap iterator context selftests/bpf: Test bpf_sock_destroy() with sockmap iterators bpf: Enable precise bucketing control for socket hashes bpf: Support key prefix filtering for socket hash iterators selftests/bpf: Fix off by one error in remove_all_established selftests/bpf: Test socket hash iterator resume scenarios selftests/bpf: Socket map + sockops insert and destroy bpf: Introduce BPF_SOCK_OPS_UDP_CONNECTED_CB bpf: Allow bpf_sock_(map|hash)_update from BPF_SOCK_OPS_UDP_CONNECTED_CB selftests/bpf: Extend insert and destroy tests for UDP sockets bpf, doc: Document map_extra and key prefix filtering for socket hash Documentation/bpf/bpf_iterators.rst | 11 + Documentation/bpf/map_sockmap.rst | 6 + include/linux/bpf.h | 4 + include/net/udp.h | 43 ++ include/uapi/linux/bpf.h | 10 + kernel/bpf/syscall.c | 1 + net/core/sock_map.c | 326 ++++++++++++--- net/ipv4/udp.c | 1 + net/ipv6/udp.c | 1 + tools/include/uapi/linux/bpf.h | 10 + .../selftests/bpf/prog_tests/sock_destroy.c | 119 +++++- .../bpf/prog_tests/sock_iter_batch.c | 121 +++++- .../selftests/bpf/prog_tests/sockmap_basic.c | 387 ++++++++++++++++++ .../selftests/bpf/progs/bpf_iter_sockmap.c | 14 + .../selftests/bpf/progs/sock_destroy_prog.c | 63 +++ .../selftests/bpf/progs/sock_iter_batch.c | 31 ++ .../selftests/bpf/progs/test_sockmap_update.c | 44 ++ 17 files changed, 1102 insertions(+), 90 deletions(-) -- 2.43.0