On Tue, Sep 9, 2025 at 2:00 AM Peilin Ye <yepeilin@xxxxxxxxxx> wrote: > > On Mon, Sep 08, 2025 at 03:51:00PM -0700, Alexei Starovoitov wrote: > > On Mon, Sep 8, 2025 at 3:42 PM Peilin Ye <yepeilin@xxxxxxxxxx> wrote: > > > Just in case - actually there was a patch that does this: > > > > > > [2] https://lore.kernel.org/bpf/20250905061919.439648-1-yepeilin@xxxxxxxxxx/ > > > > > > It was then superseded by the patches you linked [1] above however, > > > since per discussion in [2], "use bpf_mem_alloc() to skip memcg > > > accounting because it can trigger hardlockups" is a workaround instead > > > of a proper fix. > > > > > > I wonder if this new issue on PREEMPT_RT would justify [2] over [1]? > > > IIUC, until kmalloc_nolock() becomes available: > > > > > > [1] (plus Leon's patch here) means no bpf_timer on PREEMPT_RT, but we > > > still have memcg accounting for non-PREEMPT_RT; [2] means no memcg > > > accounting. > > > > I didn't comment on the above statement earlier, because > > I thought you meant "no memcg accounting _inline_", > > but reading above it sounds that you think that bpf_mem_alloc() > > doesn't do memcg accounting at all ? > > That's incorrect. bpf_mem_alloc() always uses memcg accounting > > Ah, I see - kernel/bpf/memalloc.c:alloc_bulk() via irq_work. Thanks for > the correction! > > > , but the usage is nuanced. bpf_global_ma is counted towards root memcg, > > since it's created during boot. While hash map powered by bpf_mem_alloc > > is using memcg of the user that created that map. > > - - - > IIUC, this "sleeping function called from invalid context" message on > PREEMPT_RT is because ___slab_alloc() does local_lock_irqsave(), with > IRQ disabled by __bpf_async_init(): > > __bpf_spin_lock_irqsave(&async->lock); > t = async->timer; > if (t) { > ret = -EBUSY; > goto out; > } > > /* allocate hrtimer via map_kmalloc to use memcg accounting */ > cb = bpf_map_kmalloc_node(map, size, __GFP_HIGH, map->numa_node); > > For my understanding, is/how is kmalloc_nolock() going to resolve this? > Patch [3] changes ___slab_alloc() to: > > /* must check again c->slab in case we got preempted and it changed */ > - local_lock_irqsave(&s->cpu_slab->lock, flags); > + local_lock_cpu_slab(s, &flags); > > But for PREEMPT_RT, local_lock_cpu_slab() still does > local_lock_irqsave(), and the comment says that we can't call it with > IRQ disabled: > > + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt > + * disabled context. The lock will always be acquired and if needed it > + * block and sleep until the lock is available. Of course. Not saying that s/kmalloc/kmalloc_nolock/ is a silver bullet. bpf_timer needs other work. __bpf_spin_lock_irqsave() needs to go, etc.