Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Fri, 5 Sep 2025 12:48:11 -0700

On Fri, Sep 05, 2025 at 10:31:07AM -0700, Shakeel Butt wrote:
> On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote:
> > On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@xxxxxxxxxx> wrote:
> > >
> > > Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various
> > > locking issues; see the following stack trace (edited for style) as one
> > > example:
> > >
> > > ...
> > >  [10.011566]  do_raw_spin_lock.cold
> > >  [10.011570]  try_to_wake_up             (5) double-acquiring the same
> > >  [10.011575]  kick_pool                      rq_lock, causing a hardlockup
> > >  [10.011579]  __queue_work
> > >  [10.011582]  queue_work_on
> > >  [10.011585]  kernfs_notify
> > >  [10.011589]  cgroup_file_notify
> > >  [10.011593]  try_charge_memcg           (4) memcg accounting raises an
> > >  [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
> > >  [10.011599]  obj_cgroup_charge_account
> > >  [10.011600]  __memcg_slab_post_alloc_hook
> > >  [10.011603]  __kmalloc_node_noprof
> > > ...
> > >  [10.011611]  bpf_map_kmalloc_node
> > >  [10.011612]  __bpf_async_init
> > >  [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
> > >  [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
> > >  [10.011619]  bpf__sched_ext_ops_runnable
> > >  [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
> > >  [10.011622]  enqueue_task
> > >  [10.011626]  ttwu_do_activate
> > >  [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
> > > ...
> > >
> > > The above was reproduced on bpf-next (b338cf849ec8) by modifying
> > > ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
> > > ops.runnable(), and hacking [1] the memcg accounting code a bit to make
> > > it (much more likely to) raise an MEMCG_MAX event from a
> > > bpf_timer_init() call.
> > >
> > > We have also run into other similar variants both internally (without
> > > applying the [1] hack) and on bpf-next, including:
> > >
> > >  * run_timer_softirq() -> cgroup_file_notify()
> > >    (grabs cgroup_file_kn_lock) -> try_to_wake_up() ->
> > >    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
> > >    try_charge_memcg() raises MEMCG_MAX ->
> > >    cgroup_file_notify() (tries to grab cgroup_file_kn_lock again)
> > >
> > >  * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() ->
> > >    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
> > >    try_charge_memcg() raises MEMCG_MAX -> m() ->
> > >    __queue_work() (tries to grab the same worker_pool::lock)
> > >  ...
> > >
> > > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> > > bpf_hrtimer and bpf_work, to skip memcg accounting.
> > 
> > This is a short term workaround that we shouldn't take.
> > Long term bpf_mem_alloc() will use kmalloc_nolock() and
> > memcg accounting that was already made to work from any context
> > except that the path of memcg_memory_event() wasn't converted.
> > 
> > Shakeel,
> > 
> > Any suggestions how memcg_memory_event()->cgroup_file_notify()
> > can be fixed?
> > Can we just trylock and skip the event?
> 
> Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call
> chains? If yes, then we can change memcg_memory_event() to skip calls to
> cgroup_file_notify() if spinning is not allowed.

Along with using __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init(),
we need the following patch: