Re: nft_queues.sh failures

Florian Westphal <fw@xxxxxxxxx> · Thu, 22 May 2025 16:35:16 +0200

Paolo Abeni <pabeni@xxxxxxxxxx> wrote:
> On 5/22/25 3:53 PM, Jakub Kicinski wrote:
> > On Thu, 22 May 2025 12:09:01 +0200 Paolo Abeni wrote:
> >> Recently the nipa CI infra went through some tuning, and the mentioned
> >> self-test now often fails.
> >>
> >> As I could not find any applied or pending relevant change, I have a
> >> vague suspect that the timeout applied to the server command now
> >> triggers due to different timing. Could you please have a look?
> > 
> > Oh, I was just staring at:
> > https://lore.kernel.org/all/20250522031835.4395-1-shiming.cheng@xxxxxxxxxxxx/
> > do you think it's not that?

It is, thanks Jakub!

With my updated test case, it does pass, but see for yourself:
# PASS: sctp and nfqueue in forward chain (duration: 118s)
# PASS: sctp and nfqueue in output chain with GSO (duration: 56s)

(the old timeout was 60s, so this would FAIL without the updated test).

plain net-next/main:
# PASS: sctp and nfqueue in forward chain (duration: 42s)
# PASS: sctp and nfqueue in output chain with GSO (duration: 21s)

I haven't debugged yet but i'd guess that some packets get corrupted
when nfqueue segments gso skbs, thus forcing retransmits.

> It's not obvious to me. The failing test case is:
> 
> tcp via loopback and re-queueing
> 
> There should be no S/W segmentation there, as the loopback interface
> exposes TSO.

The nfqueue test also forces software segmentation, even for lo, so that
the userspace listener gets non-aggregated packets (its possible to
disable this so 'large packets' get queued to userspace, this is also
tested for tcp by this selftest).

> @Florian, I'm sorry I should have mentioned explicitly the failing test
> before. Sample failures:
> 
> https://netdev-3.bots.linux.dev/vmksft-nf/results/131921/2-nft-queue-sh/stdout
> https://netdev-3.bots.linux.dev/vmksft-nf/results/131741/2-nft-queue-sh/stdout

both show sctp failing:

# PASS: tcp via loopback and re-queueing

---> tcp loopback passes

# 2025/05/22 05:11:46 socat[32441] E write(7, 0x55ca6b34e000, 8192): Connection reset by peer
# cmp: EOF on /tmp/tmp.1LVNFztWUK after byte 50208768, in line 1
# FAIL: sctp forward: input and output file differ
#  Input file-rw------- 1 root root 209715200 May 22 05:10 /tmp/tmp.teqIUO7Jfh
# Output file-rw------- 1 root root 50208768 May 22 05:11 /tmp/tmp.1LVNFztWUK
# 2025/05/22 05:12:46 socat[32459] E write(7, 0x561110e23000, 8192): Connection reset by peer
# cmp: EOF on /tmp/tmp.1LVNFztWUK after byte 36528128, in line 1
# FAIL: sctp output: input and output file differ

so its sctp+nfqueue thats failing.
And it does seem to be related to the pending patch pointed out by
Jakub.
> > I'll hide both that patch and Florian's fix from the queue for now, 
> > for a test.
> 
> Fine by me.

I'll resend the update tomorrow, keeping the OLD timeout of 60s, I think
keeping track of the 'transmit time' in the test log archives could be
useful in the future.

> I was wondering about this timeout specifically:
> 
> https://elixir.bootlin.com/linux/v6.15-rc7/source/tools/testing/selftests/net/netfilter/nft_queue.sh#L329

5s isn't so short, lo is supposed to be fast (the userspace prog
asks for GSO packets, so no s/w segmentation should happen but even
with GSO segmentation I would not expect it to fail).

I would prefer to keep the 5s for tcp; I don't recall this was a problem
in the past.