On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@xxxxxxxxxx> wrote: > > Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various > locking issues; see the following stack trace (edited for style) as one > example: > > ... > [10.011566] do_raw_spin_lock.cold > [10.011570] try_to_wake_up (5) double-acquiring the same > [10.011575] kick_pool rq_lock, causing a hardlockup > [10.011579] __queue_work > [10.011582] queue_work_on > [10.011585] kernfs_notify > [10.011589] cgroup_file_notify > [10.011593] try_charge_memcg (4) memcg accounting raises an > [10.011597] obj_cgroup_charge_pages MEMCG_MAX event > [10.011599] obj_cgroup_charge_account > [10.011600] __memcg_slab_post_alloc_hook > [10.011603] __kmalloc_node_noprof > ... > [10.011611] bpf_map_kmalloc_node > [10.011612] __bpf_async_init > [10.011615] bpf_timer_init (3) BPF calls bpf_timer_init() > [10.011617] bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable > [10.011619] bpf__sched_ext_ops_runnable > [10.011620] enqueue_task_scx (2) BPF runs with rq_lock held > [10.011622] enqueue_task > [10.011626] ttwu_do_activate > [10.011629] sched_ttwu_pending (1) grabs rq_lock > ... > > The above was reproduced on bpf-next (b338cf849ec8) by modifying > ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during > ops.runnable(), and hacking [1] the memcg accounting code a bit to make > it (much more likely to) raise an MEMCG_MAX event from a > bpf_timer_init() call. > > We have also run into other similar variants both internally (without > applying the [1] hack) and on bpf-next, including: > > * run_timer_softirq() -> cgroup_file_notify() > (grabs cgroup_file_kn_lock) -> try_to_wake_up() -> > BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() -> > try_charge_memcg() raises MEMCG_MAX -> > cgroup_file_notify() (tries to grab cgroup_file_kn_lock again) > > * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() -> > BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() -> > try_charge_memcg() raises MEMCG_MAX -> m() -> > __queue_work() (tries to grab the same worker_pool::lock) > ... > > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for > bpf_hrtimer and bpf_work, to skip memcg accounting. This is a short term workaround that we shouldn't take. Long term bpf_mem_alloc() will use kmalloc_nolock() and memcg accounting that was already made to work from any context except that the path of memcg_memory_event() wasn't converted. Shakeel, Any suggestions how memcg_memory_event()->cgroup_file_notify() can be fixed? Can we just trylock and skip the event?