On 28/8/25 08:42, Alexei Starovoitov wrote: > On Tue, Aug 26, 2025 at 7:58 PM Leon Hwang <leon.hwang@xxxxxxxxx> wrote: >> >> >> >> On 27/8/25 10:23, Alexei Starovoitov wrote: >>> On Tue, Aug 26, 2025 at 7:13 PM Leon Hwang <leon.hwang@xxxxxxxxx> wrote: >>>> >>>> Hi, >>>> >>>> I’ve encountered a reproducible deadlock while developing the funcgraph >>>> feature for bpfsnoop [0]. >>> >>> debug it pls. >> >> It’s quite difficult for me. I’ve tried debugging it but didn’t succeed. >> >>> Sounds like you're implying that the root cause is in bpf, >>> but why do you think so? >>> >>> You're attaching to things that shouldn't be attached to. >>> Like rcu_lockdep_current_cpu_online() >>> so effectively you're recursing in that lockdep code. >>> See big lock there. It will dead lock for sure. >> >> If a function that acquires a lock can be traced by a tracing program, >> bpfsnoop’s funcgraph will attempt to trace it as well. In such cases, a >> deadlock is highly likely to occur. >> >> With bpfsnoop I try my best to avoid such deadlock issues. But what >> about other bpf tracing tools? If they don’t handle this properly, the >> kernel is very likely to crash. > > bpf infra is trying hard not to crash it, but debug kernel is a different > category. rcu_read_lock_held() doesn't exist in production kernels. > You can propose adding "notrace" for it, but in general that doesn't scale. > Same with rcu_lockdep_current_cpu_online(). > It probably deserves "notrace" too. Indeed, it doesn't scale. When I run ./bpfsnoop -k "htab_*_elem" --output-fgraph --fgraph-debug --fgraph-exclude 'rcu_read_lock_*held,rcu_lockdep_current_cpu_online,*raw_spin_*lock*,kvfree,show_stack,put_task_stack', the kernel doesn’t panic, but the OS eventually stalls and becomes unresponsive to key presses. It seems preferable to avoid running BPF programs continuously in such cases. Thanks, Leon