Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Fri, 5 Sep 2025 10:31:07 -0700

On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote:
> On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@xxxxxxxxxx> wrote:
> >
> > Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various
> > locking issues; see the following stack trace (edited for style) as one
> > example:
> >
> > ...
> >  [10.011566]  do_raw_spin_lock.cold
> >  [10.011570]  try_to_wake_up             (5) double-acquiring the same
> >  [10.011575]  kick_pool                      rq_lock, causing a hardlockup
> >  [10.011579]  __queue_work
> >  [10.011582]  queue_work_on
> >  [10.011585]  kernfs_notify
> >  [10.011589]  cgroup_file_notify
> >  [10.011593]  try_charge_memcg           (4) memcg accounting raises an
> >  [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
> >  [10.011599]  obj_cgroup_charge_account
> >  [10.011600]  __memcg_slab_post_alloc_hook
> >  [10.011603]  __kmalloc_node_noprof
> > ...
> >  [10.011611]  bpf_map_kmalloc_node
> >  [10.011612]  __bpf_async_init
> >  [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
> >  [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
> >  [10.011619]  bpf__sched_ext_ops_runnable
> >  [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
> >  [10.011622]  enqueue_task
> >  [10.011626]  ttwu_do_activate
> >  [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
> > ...
> >
> > The above was reproduced on bpf-next (b338cf849ec8) by modifying
> > ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
> > ops.runnable(), and hacking [1] the memcg accounting code a bit to make
> > it (much more likely to) raise an MEMCG_MAX event from a
> > bpf_timer_init() call.
> >
> > We have also run into other similar variants both internally (without
> > applying the [1] hack) and on bpf-next, including:
> >
> >  * run_timer_softirq() -> cgroup_file_notify()
> >    (grabs cgroup_file_kn_lock) -> try_to_wake_up() ->
> >    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
> >    try_charge_memcg() raises MEMCG_MAX ->
> >    cgroup_file_notify() (tries to grab cgroup_file_kn_lock again)
> >
> >  * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() ->
> >    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
> >    try_charge_memcg() raises MEMCG_MAX -> m() ->
> >    __queue_work() (tries to grab the same worker_pool::lock)
> >  ...
> >
> > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> > bpf_hrtimer and bpf_work, to skip memcg accounting.
> 
> This is a short term workaround that we shouldn't take.
> Long term bpf_mem_alloc() will use kmalloc_nolock() and
> memcg accounting that was already made to work from any context
> except that the path of memcg_memory_event() wasn't converted.
> 
> Shakeel,
> 
> Any suggestions how memcg_memory_event()->cgroup_file_notify()
> can be fixed?
> Can we just trylock and skip the event?

Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call
chains? If yes, then we can change memcg_memory_event() to skip calls to
cgroup_file_notify() if spinning is not allowed.