On Wed, Sep 3, 2025 at 10:51 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > On 9/3/25 10:08 AM, Kuniyuki Iwashima wrote: > > On Wed, Sep 3, 2025 at 9:59 AM Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> wrote: > >> > >> On Tue, Sep 2, 2025 at 1:49 PM Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> wrote: > >>> > >>> On Tue, Sep 2, 2025 at 1:26 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >>>> > >>>> On 8/28/25 6:00 PM, Kuniyuki Iwashima wrote: > >>>>> The test does the following for IPv4/IPv6 x TCP/UDP sockets > >>>>> with/without BPF prog. > >>>>> > >>>>> 1. Create socket pairs > >>>>> 2. Send a bunch of data that requires more than 256 pages > >>>>> 3. Read memory_allocated from the 3rd column in /proc/net/protocols > >>>>> 4. Check if unread data is charged to memory_allocated > >>>>> > >>>>> If BPF prog is attached, memory_allocated should not be changed, > >>>>> but we allow a small error (up to 10 pages) in case other processes > >>>>> on the host use some amounts of TCP/UDP memory. > >>>>> > >>>>> At 2., the test actually sends more than 1024 pages because the sysctl > >>>>> net.core.mem_pcpu_rsv is 256 is by default, which means 256 pages are > >>>>> buffered per cpu before reporting to sk->sk_prot->memory_allocated. > >>>>> > >>>>> BUF_SINGLE (1024) * NR_SEND (64) * NR_SOCKETS (64) / 4096 > >>>>> = 1024 pages > >>>>> > >>>>> When I reduced it to 512 pages, the following assertion for the > >>>>> non-isolated case got flaky. > >>>>> > >>>>> ASSERT_GT(memory_allocated[1], memory_allocated[0] + 256, ...) > >>>>> > >>>>> Another contributor to slowness is 150ms sleep to make sure 1 RCU > >>>>> grace period passes because UDP recv queue is destroyed after that. > >>>> > >>>> There is a kern_sync_rcu() in testing_helpers.c. > >>> > >>> Nice helper :) Will use it. > >>> > >>>> > >>>>> > >>>>> # time ./test_progs -t sk_memcg > >>>>> #370/1 sk_memcg/TCP :OK > >>>>> #370/2 sk_memcg/UDP :OK > >>>>> #370/3 sk_memcg/TCPv6 :OK > >>>>> #370/4 sk_memcg/UDPv6 :OK > >>>>> #370 sk_memcg:OK > >>>>> Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED > >>>>> > >>>>> real 0m1.214s > >>>>> user 0m0.014s > >>>>> sys 0m0.318s > >>>> > >>>> Thanks. It finished much faster in my setup also comparing with the earlier > >>>> revision. However, it is a bit flaky when I run it in a loop: > >>>> > >>>> check_isolated:FAIL:not isolated unexpected not isolated: actual 861 <= expected 861 > >>>> > >>>> I usually can hit this at ~40-th iteration. > >>> > >>> Oh.. I tested ~10 times manually but will try in a tight loop. > >> > >> This didn't reproduce on my QEMU with/without --enable-kvm. > >> > >> Changing the assert from _GT to _GE will address the very case > >> above, but I'm not sure if it's enough. > > > > I doubled NR_SEND and it was still faster with kern_sync_rcu() > > than usleep(), so I'll simply double NR_SEND in v5 > > > > # time ./test_progs -t sk_memcg > > ... > > Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED > > real 0m0.483s > > user 0m0.010s > > sys 0m0.191s > > > > > >> > >> Does the bpf CI run tests repeatedly or is this only a manual > >> scenario ? > > I haven't seen bpf CI hit it yet. It is in my manual bash while loop. It should > not be dismissed so easily. Some flaky CI tests were eventually reproduced in a > loop before and fixed. I kept the bash loop continue this time until grep-ed a > "0" from the error output: > > check_isolated:FAIL:not isolated unexpected not isolated: actual 0 <= expected 256 > > The "long memory_allocated[2]" read from /proc/net/protocols are printed as 0 > but it is probably actually negative: > > static inline long > proto_memory_allocated(const struct proto *prot) > { > return max(0L, atomic_long_read(prot->memory_allocated)); > } > > prot->memory_allocated could be negative afaict but printed as 0 in > /proc/net/protocols. Even the machine is network quiet after test_progs started, > the "prot->memory_allocated" and the "proto->per_cpu_fw_alloc" could be in some > random states before the test_progs start. When I hit "0", it will take some > efforts to send some random traffic to the machine to get the test working again. :( > > Also, after reading the selftest closer, I am not sure I understand why "+ 256". > The "proto-> per_cpu_fw_alloc" can start with -255 or +255. Actually I didn't expect the random state and assumed the test's local communication would complete on the same CPU thus 0~255. Do you see the flakiness with net.core.mem_pcpu_rsv=0 ? The per-cpu cache is just for performance and I think it's not critical for testing and it's fine to set it to 0 during the test. > > I don't think changing NR_SEND help here. It needs a better way. May be some > functions can be traced such that prot->memory_allocated can be read directly? > If fentry and fexit of that function has different memory_allocated values, then > the test could also become more straight forward. Maybe like this ? Not yet tested, but we could attach a prog to sock_init_data() or somewhere else and trigger it by additional socket(2). memory_allocated = sk->sk_prot->memory_allocated; nr_cpu = bpf_num_possible_cpus(); for (i = 0; i < nr_cpu; i++) { per_cpu_fw_alloc = bpf_per_cpu_ptr(sk->sk_prot->per_cpu_fw_alloc, i); if (per_cpu_fw_alloc) memory_allocated += *per_cpu_fw_alloc; } per_cpu_fw_alloc might have been added to sk_prot->memory_allocated during loop, so it's not 100% accurate still. Probably we should set net.core.mem_pcpu_rsv=0 and stress memory_allocated before the actual test to drain per_cpu_fw_alloc (at least on the testing CPU).