On Thu, Sep 4, 2025 at 12:48 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > On 9/4/25 9:45 AM, Kuniyuki Iwashima wrote: > > On Wed, Sep 3, 2025 at 10:51 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >> > >> On 9/3/25 10:08 AM, Kuniyuki Iwashima wrote: > >>> On Wed, Sep 3, 2025 at 9:59 AM Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> wrote: > >>>> > >>>> On Tue, Sep 2, 2025 at 1:49 PM Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> wrote: > >>>>> > >>>>> On Tue, Sep 2, 2025 at 1:26 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >>>>>> > >>>>>> On 8/28/25 6:00 PM, Kuniyuki Iwashima wrote: > >>>>>>> The test does the following for IPv4/IPv6 x TCP/UDP sockets > >>>>>>> with/without BPF prog. > >>>>>>> > >>>>>>> 1. Create socket pairs > >>>>>>> 2. Send a bunch of data that requires more than 256 pages > >>>>>>> 3. Read memory_allocated from the 3rd column in /proc/net/protocols > >>>>>>> 4. Check if unread data is charged to memory_allocated > >>>>>>> > >>>>>>> If BPF prog is attached, memory_allocated should not be changed, > >>>>>>> but we allow a small error (up to 10 pages) in case other processes > >>>>>>> on the host use some amounts of TCP/UDP memory. > >>>>>>> > >>>>>>> At 2., the test actually sends more than 1024 pages because the sysctl > >>>>>>> net.core.mem_pcpu_rsv is 256 is by default, which means 256 pages are > >>>>>>> buffered per cpu before reporting to sk->sk_prot->memory_allocated. > >>>>>>> > >>>>>>> BUF_SINGLE (1024) * NR_SEND (64) * NR_SOCKETS (64) / 4096 > >>>>>>> = 1024 pages > >>>>>>> > >>>>>>> When I reduced it to 512 pages, the following assertion for the > >>>>>>> non-isolated case got flaky. > >>>>>>> > >>>>>>> ASSERT_GT(memory_allocated[1], memory_allocated[0] + 256, ...) > >>>>>>> > >>>>>>> Another contributor to slowness is 150ms sleep to make sure 1 RCU > >>>>>>> grace period passes because UDP recv queue is destroyed after that. > >>>>>> > >>>>>> There is a kern_sync_rcu() in testing_helpers.c. > >>>>> > >>>>> Nice helper :) Will use it. > >>>>> > >>>>>> > >>>>>>> > >>>>>>> # time ./test_progs -t sk_memcg > >>>>>>> #370/1 sk_memcg/TCP :OK > >>>>>>> #370/2 sk_memcg/UDP :OK > >>>>>>> #370/3 sk_memcg/TCPv6 :OK > >>>>>>> #370/4 sk_memcg/UDPv6 :OK > >>>>>>> #370 sk_memcg:OK > >>>>>>> Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED > >>>>>>> > >>>>>>> real 0m1.214s > >>>>>>> user 0m0.014s > >>>>>>> sys 0m0.318s > >>>>>> > >>>>>> Thanks. It finished much faster in my setup also comparing with the earlier > >>>>>> revision. However, it is a bit flaky when I run it in a loop: > >>>>>> > >>>>>> check_isolated:FAIL:not isolated unexpected not isolated: actual 861 <= expected 861 > >>>>>> > >>>>>> I usually can hit this at ~40-th iteration. > >>>>> > >>>>> Oh.. I tested ~10 times manually but will try in a tight loop. > >>>> > >>>> This didn't reproduce on my QEMU with/without --enable-kvm. > >>>> > >>>> Changing the assert from _GT to _GE will address the very case > >>>> above, but I'm not sure if it's enough. > >>> > >>> I doubled NR_SEND and it was still faster with kern_sync_rcu() > >>> than usleep(), so I'll simply double NR_SEND in v5 > >>> > >>> # time ./test_progs -t sk_memcg > >>> ... > >>> Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED > >>> real 0m0.483s > >>> user 0m0.010s > >>> sys 0m0.191s > >>> > >>> > >>>> > >>>> Does the bpf CI run tests repeatedly or is this only a manual > >>>> scenario ? > >> > >> I haven't seen bpf CI hit it yet. It is in my manual bash while loop. It should > >> not be dismissed so easily. Some flaky CI tests were eventually reproduced in a > >> loop before and fixed. I kept the bash loop continue this time until grep-ed a > >> "0" from the error output: > >> > >> check_isolated:FAIL:not isolated unexpected not isolated: actual 0 <= expected 256 > >> > >> The "long memory_allocated[2]" read from /proc/net/protocols are printed as 0 > >> but it is probably actually negative: > >> > >> static inline long > >> proto_memory_allocated(const struct proto *prot) > >> { > >> return max(0L, atomic_long_read(prot->memory_allocated)); > >> } > >> > >> prot->memory_allocated could be negative afaict but printed as 0 in > >> /proc/net/protocols. Even the machine is network quiet after test_progs started, > >> the "prot->memory_allocated" and the "proto->per_cpu_fw_alloc" could be in some > >> random states before the test_progs start. When I hit "0", it will take some > >> efforts to send some random traffic to the machine to get the test working again. :( > >> > >> Also, after reading the selftest closer, I am not sure I understand why "+ 256". > >> The "proto-> per_cpu_fw_alloc" can start with -255 or +255. > > > > Actually I didn't expect the random state and assumed the test's > > local communication would complete on the same CPU thus 0~255. > > > > Do you see the flakiness with net.core.mem_pcpu_rsv=0 ? > > > > The per-cpu cache is just for performance and I think it's not > > critical for testing and it's fine to set it to 0 during the test. > > > > > >> > >> I don't think changing NR_SEND help here. It needs a better way. May be some > >> functions can be traced such that prot->memory_allocated can be read directly? > >> If fentry and fexit of that function has different memory_allocated values, then > >> the test could also become more straight forward. > > > > Maybe like this ? Not yet tested, but we could attach a prog to > > sock_init_data() or somewhere else and trigger it by additional socket(2). > > > > memory_allocated = sk->sk_prot->memory_allocated; > > nr_cpu = bpf_num_possible_cpus(); > > > > for (i = 0; i < nr_cpu; i++) { > > per_cpu_fw_alloc = > > bpf_per_cpu_ptr(sk->sk_prot->per_cpu_fw_alloc, i); > > I suspect passing per_cpu_fw_alloc to bpf_per_cpu_ptr won't work for now. sk is > trusted if it is a "tp_btf" but I don't think the verifier recognizes the > sk->sk_prot is a trusted ptr. I haven't tested it though. If the above does not > work, try to directly use the global percpu tcp_memory_per_cpu_fw_alloc. Take a > look at how "bpf_prog_active" is used in test_ksyms_btf.c. Thanks for the pointer ! I was looking for a way to read global vars. > > > if (per_cpu_fw_alloc) > > memory_allocated += *per_cpu_fw_alloc; > > Yeah. I think figuring out the true memory_allocated value and use it as the > before/after value should be good enough. Then no need to worry about the > initial states. I wonder why proto_memory_allocated() does not do that for > /proc/net/protocols but I guess it may not be accurate for a lot of cores. Probably, and per_cpu_fw_alloc is expected to flip quickly. > > > } > > > > per_cpu_fw_alloc might have been added to sk_prot->memory_allocated > > during loop, so it's not 100% accurate still. > > > > Probably we should set net.core.mem_pcpu_rsv=0 and stress > > memory_allocated before the actual test to drain per_cpu_fw_alloc > > (at least on the testing CPU). > I think the best is if a suitable kernel func can be traced or figure out the > true memory_allocated value. At least figuring out the true memory_allocated > seems doable. If nothing of the above works out, mem_pcpu_rsv=0 and > pre-stress/pre-flush should help by getting the per_cpu_fw_alloc and > memory_allocated to some certain states before using it in the before/after result. > > [ Before re-spinning, need to conclude/resolve the on-going discussion in v5 first ] I'll wait for Shakeel's response for v5.