TCP socket iterators use iter->offset to track progress through a bucket, which is a measure of the number of matching sockets from the current bucket that have been seen or processed by the iterator. On subsequent iterations, if the current bucket has unprocessed items, we skip at least iter->offset matching items in the bucket before adding any remaining items to the next batch. However, iter->offset isn't always an accurate measure of "things already seen" when the underlying bucket changes between reads which can lead to repeated or skipped sockets. Instead, this series remembers the cookies of the sockets we haven't seen yet in the current bucket and resumes from the first cookie in that list that we can find on the next iteration. This is a continuation of the work started in [1]. This series largely replicates the patterns applied to UDP socket iterators, applying them instead to TCP socket iterators. CHANGES ======= v2 -> v3: * Unroll the loop inside bpf_iter_tcp_batch to make the logic easier to follow in patch two ("bpf: tcp: Make sure iter->batch always contains a full bucket snapshot"). This gets rid of the `resizes` variable from v2 and eliminates the extra conditional that checks how many batch resize attempts have occurred so far (Stanislav). Note: This changes the behavior slightly. Before, in the case that the second call to tcp_seek_last_pos (and later bpf_iter_tcp_resume) advances to a new bucket, which may happen if the current bucket is emptied after releasing its lock, the `resizes` "budget" would be reset, the net effect being that we would try a batch resize with GFP_USER at most once per bucket. Now, we try to resize the batch with GFP_USER at most once per call, so it makes it slightly more likely that we hit the GFP_NOWAIT scenario. However, this edge case should be rare in practice anyway, and the new behavior is more or less consistent with the original retry logic, so avoid the loop and prefer code clarity. * Move the call to bpf_iter_tcp_put_batch out of bpf_iter_tcp_realloc_batch and call it directly before invoking bpf_iter_tcp_realloc_batch with GFP_USER inside bpf_iter_tcp_batch. /Don't/ call it before invoking bpf_iter_tcp_realloc_batch the second time while we hold the lock with GFP_NOWAIT. This avoids a conditional inside bpf_iter_tcp_realloc_batch from v2 that only calls bpf_iter_tcp_put_batch if flags != GFP_NOWAIT and is a bit more explicit (Stanislav). * Adjust patch five ("bpf: tcp: Avoid socket skips and repeats during iteration") to fit with the new logic in patch two. v1 -> v2: * In patch five ("bpf: tcp: Avoid socket skips and repeats during iteration"), remove unnecessary bucket bounds checks in bpf_iter_tcp_resume. In either case, if st->bucket is outside the current table's range then bpf_iter_tcp_resume_* calls *_get_first which immediately returns NULL anyway and the logic will fall through. (Martin) * Add a check at the top of bpf_iter_tcp_resume_listening and bpf_iter_tcp_resume_established to see if we're done with the current bucket and advance it immediately instead of wasting time finding the first matching socket in that bucket with (listening|established)_get_first. In v1, we originally discussed adding logic to advance the bucket in bpf_iter_tcp_seq_next and bpf_iter_tcp_seq_stop, but after trying this the logic seemed harder to track. Overall, keeping everything inside bpf_iter_tcp_resume_* seemed a bit clearer. (Martin) * Instead of using a timeout in the last patch ("selftests/bpf: Add tests for bucket resume logic in established sockets") to wait for sockets to leave the ehash table after calling close(), use bpf_sock_destroy to deterministically destroy and remove them. This introduces one more patch ("selftests/bpf: Create iter_tcp_destroy test program") to create the iterator program that destroys a selected socket. Drive this through a destroy() function in the last patch which, just like close(), accepts a socket file descriptor. (Martin) * Introduce one more patch ("selftests/bpf: Allow for iteration over multiple states") to fix a latent bug in iter_tcp_soreuse where the sk->sk_state != TCP_LISTEN check was ignored. Add the "ss" variable to allow test code to configure which socket states to allow. [1]: https://lore.kernel.org/bpf/20250502161528.264630-1-jordan@xxxxxxxx/ Jordan Rife (12): bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch bpf: tcp: Make sure iter->batch always contains a full bucket snapshot bpf: tcp: Get rid of st_bucket_done bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items bpf: tcp: Avoid socket skips and repeats during iteration selftests/bpf: Add tests for bucket resume logic in listening sockets selftests/bpf: Allow for iteration over multiple ports selftests/bpf: Allow for iteration over multiple states selftests/bpf: Make ehash buckets configurable in socket iterator tests selftests/bpf: Create established sockets in socket iterator tests selftests/bpf: Create iter_tcp_destroy test program selftests/bpf: Add tests for bucket resume logic in established sockets net/ipv4/tcp_ipv4.c | 269 ++++++++--- .../bpf/prog_tests/sock_iter_batch.c | 450 +++++++++++++++++- .../selftests/bpf/progs/sock_iter_batch.c | 37 +- 3 files changed, 672 insertions(+), 84 deletions(-) -- 2.43.0