Re: [PATCH v1 bpf-next 03/10] bpf: tcp: Get rid of st_bucket_done

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Fri, 23 May 2025 15:07:32 -0700

On 5/22/25 1:42 PM, Kuniyuki Iwashima wrote:
From: Jordan Rife <jordan@xxxxxxxx>
Date: Thu, 22 May 2025 11:16:13 -0700
  static void bpf_iter_tcp_put_batch(struct bpf_tcp_iter_state *iter)
  {
-	while (iter->cur_sk < iter->end_sk)
-		sock_gen_put(iter->batch[iter->cur_sk++]);
+	unsigned int cur_sk = iter->cur_sk;
+
+	while (cur_sk < iter->end_sk)
+		sock_gen_put(iter->batch[cur_sk++]);

Why is this chunk included in this patch ?

This should be in patch 5 to keep cur_sk for find_cookie

Without this, iter->cur_sk is mutated when iteration stops, and we lose
our place. When iteration resumes and we call bpf_iter_tcp_batch the
iter->cur_sk == iter->end_sk condition will always be true, so we will
skip to the next bucket without seeking to the offset.

Before, we relied on st_bucket_done to tell us if we had remaining items
in the current bucket to process but now need to preserve iter->cur_sk
through iterations to make the behavior equivalent to what we had before.

Thanks for explanation, I was confused by calling tcp_seek_last_pos()
multiple times, and I think we need to preserve/restore st->offset too
in patch 2 and need this change.

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ac00015d5e7a..0816f20bfdff 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2791,6 +2791,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
  			break;
  		st->bucket = 0;
  		st->state = TCP_SEQ_STATE_ESTABLISHED;
+		offset = 0;

This seems like an existing bug not necessarily related to this set.

The patch 5 has also removed the tcp_seek_last_pos() dependency, so I think it 
can be a standalone fix on its own.


  		fallthrough;
  	case TCP_SEQ_STATE_ESTABLISHED:
  		if (st->bucket > hinfo->ehash_mask)> 

Let's say we are resuming at an offset (10) in the last lhash bucket
but a few sockets (3) disappeared, then we go to the ehash part with
a non-zero offset (3), which will overwrite st->offset (3).

If the ehash does not fit into the batch size, we need to allocate
a new batch and retry, but the offset (3) is different from the
first try (10).