Re: [BUG] Deadlock triggered by bpfsnoop funcgraph feature

Leon Hwang <leon.hwang@xxxxxxxxx> · Thu, 28 Aug 2025 10:40:47 +0800

On 28/8/25 08:42, Alexei Starovoitov wrote:
> On Tue, Aug 26, 2025 at 7:58 PM Leon Hwang <leon.hwang@xxxxxxxxx> wrote:
>>
>>
>>
>> On 27/8/25 10:23, Alexei Starovoitov wrote:
>>> On Tue, Aug 26, 2025 at 7:13 PM Leon Hwang <leon.hwang@xxxxxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I’ve encountered a reproducible deadlock while developing the funcgraph
>>>> feature for bpfsnoop [0].
>>>
>>> debug it pls.
>>
>> It’s quite difficult for me. I’ve tried debugging it but didn’t succeed.
>>
>>> Sounds like you're implying that the root cause is in bpf,
>>> but why do you think so?
>>>
>>> You're attaching to things that shouldn't be attached to.
>>> Like rcu_lockdep_current_cpu_online()
>>> so effectively you're recursing in that lockdep code.
>>> See big lock there. It will dead lock for sure.
>>
>> If a function that acquires a lock can be traced by a tracing program,
>> bpfsnoop’s funcgraph will attempt to trace it as well. In such cases, a
>> deadlock is highly likely to occur.
>>
>> With bpfsnoop I try my best to avoid such deadlock issues. But what
>> about other bpf tracing tools? If they don’t handle this properly, the
>> kernel is very likely to crash.
> 
> bpf infra is trying hard not to crash it, but debug kernel is a different
> category. rcu_read_lock_held() doesn't exist in production kernels.
> You can propose adding "notrace" for it, but in general that doesn't scale.
> Same with rcu_lockdep_current_cpu_online().
> It probably deserves "notrace" too.

Indeed, it doesn't scale.

When I run
./bpfsnoop -k "htab_*_elem" --output-fgraph --fgraph-debug
--fgraph-exclude
'rcu_read_lock_*held,rcu_lockdep_current_cpu_online,*raw_spin_*lock*,kvfree,show_stack,put_task_stack',
the kernel doesn’t panic, but the OS eventually stalls and becomes
unresponsive to key presses.

It seems preferable to avoid running BPF programs continuously in such
cases.

Thanks,
Leon