Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 5 Sep 2025 08:18:25 -0700

On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@xxxxxxxxxx> wrote:
>
> Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various
> locking issues; see the following stack trace (edited for style) as one
> example:
>
> ...
>  [10.011566]  do_raw_spin_lock.cold
>  [10.011570]  try_to_wake_up             (5) double-acquiring the same
>  [10.011575]  kick_pool                      rq_lock, causing a hardlockup
>  [10.011579]  __queue_work
>  [10.011582]  queue_work_on
>  [10.011585]  kernfs_notify
>  [10.011589]  cgroup_file_notify
>  [10.011593]  try_charge_memcg           (4) memcg accounting raises an
>  [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
>  [10.011599]  obj_cgroup_charge_account
>  [10.011600]  __memcg_slab_post_alloc_hook
>  [10.011603]  __kmalloc_node_noprof
> ...
>  [10.011611]  bpf_map_kmalloc_node
>  [10.011612]  __bpf_async_init
>  [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
>  [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
>  [10.011619]  bpf__sched_ext_ops_runnable
>  [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
>  [10.011622]  enqueue_task
>  [10.011626]  ttwu_do_activate
>  [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
> ...
>
> The above was reproduced on bpf-next (b338cf849ec8) by modifying
> ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
> ops.runnable(), and hacking [1] the memcg accounting code a bit to make
> it (much more likely to) raise an MEMCG_MAX event from a
> bpf_timer_init() call.
>
> We have also run into other similar variants both internally (without
> applying the [1] hack) and on bpf-next, including:
>
>  * run_timer_softirq() -> cgroup_file_notify()
>    (grabs cgroup_file_kn_lock) -> try_to_wake_up() ->
>    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
>    try_charge_memcg() raises MEMCG_MAX ->
>    cgroup_file_notify() (tries to grab cgroup_file_kn_lock again)
>
>  * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() ->
>    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
>    try_charge_memcg() raises MEMCG_MAX -> m() ->
>    __queue_work() (tries to grab the same worker_pool::lock)
>  ...
>
> As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> bpf_hrtimer and bpf_work, to skip memcg accounting.

This is a short term workaround that we shouldn't take.
Long term bpf_mem_alloc() will use kmalloc_nolock() and
memcg accounting that was already made to work from any context
except that the path of memcg_memory_event() wasn't converted.

Shakeel,

Any suggestions how memcg_memory_event()->cgroup_file_notify()
can be fixed?
Can we just trylock and skip the event?