On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote: > On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@xxxxxxxxxx> wrote: > > > > Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various > > locking issues; see the following stack trace (edited for style) as one > > example: > > > > ... > > [10.011566] do_raw_spin_lock.cold > > [10.011570] try_to_wake_up (5) double-acquiring the same > > [10.011575] kick_pool rq_lock, causing a hardlockup > > [10.011579] __queue_work > > [10.011582] queue_work_on > > [10.011585] kernfs_notify > > [10.011589] cgroup_file_notify > > [10.011593] try_charge_memcg (4) memcg accounting raises an > > [10.011597] obj_cgroup_charge_pages MEMCG_MAX event > > [10.011599] obj_cgroup_charge_account > > [10.011600] __memcg_slab_post_alloc_hook > > [10.011603] __kmalloc_node_noprof > > ... > > [10.011611] bpf_map_kmalloc_node > > [10.011612] __bpf_async_init > > [10.011615] bpf_timer_init (3) BPF calls bpf_timer_init() > > [10.011617] bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable > > [10.011619] bpf__sched_ext_ops_runnable > > [10.011620] enqueue_task_scx (2) BPF runs with rq_lock held > > [10.011622] enqueue_task > > [10.011626] ttwu_do_activate > > [10.011629] sched_ttwu_pending (1) grabs rq_lock > > ... > > > > The above was reproduced on bpf-next (b338cf849ec8) by modifying > > ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during > > ops.runnable(), and hacking [1] the memcg accounting code a bit to make > > it (much more likely to) raise an MEMCG_MAX event from a > > bpf_timer_init() call. > > > > We have also run into other similar variants both internally (without > > applying the [1] hack) and on bpf-next, including: > > > > * run_timer_softirq() -> cgroup_file_notify() > > (grabs cgroup_file_kn_lock) -> try_to_wake_up() -> > > BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() -> > > try_charge_memcg() raises MEMCG_MAX -> > > cgroup_file_notify() (tries to grab cgroup_file_kn_lock again) > > > > * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() -> > > BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() -> > > try_charge_memcg() raises MEMCG_MAX -> m() -> > > __queue_work() (tries to grab the same worker_pool::lock) > > ... > > > > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for > > bpf_hrtimer and bpf_work, to skip memcg accounting. > > This is a short term workaround that we shouldn't take. > Long term bpf_mem_alloc() will use kmalloc_nolock() and > memcg accounting that was already made to work from any context > except that the path of memcg_memory_event() wasn't converted. > > Shakeel, > > Any suggestions how memcg_memory_event()->cgroup_file_notify() > can be fixed? > Can we just trylock and skip the event? Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call chains? If yes, then we can change memcg_memory_event() to skip calls to cgroup_file_notify() if spinning is not allowed.