Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 16 Jul 2025 19:50:37 -0700

On Wed, Jul 16, 2025 at 3:58 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>
> On 7/16/25 04:29, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> >
> > kmalloc_nolock() relies on ability of local_lock to detect the situation
>
>                                         ^ local_trylock_t perhaps?
>
> > when it's locked.
> > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
>
> It can be also true when you call it from the bpf hook, no?

Correct. Technically one can disasm ___slab_alloc(),
find some instruction after pushfq;cli,
add kprobe there, attach bpf prog to kprobe, and call kmalloc_nolock
(eventually, when bpf infra switches to kmalloc_nolock).
I wouldn't call it malicious, but if somebody does that
they are looking for trouble. Even syzbot doesn't do such things.

> > In that case retry the operation in a different kmalloc bucket.
> > The second attempt will likely succeed, since this cpu locked
> > different kmem_cache_cpu.
> >
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > a different bucket that is most likely is not locked by the current
> > task. Though it may be locked by a different task it's safe to
> > rt_spin_lock() on it.
> >
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
> >
> > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > and in_nmi() or in PREEMPT_RT.
> >
> > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> > always defers to irq_work.
> >
> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
>
> Haven't seen an obvious bug now but will ponder it some more. Meanwhile some
> nits and maybe one bit more serious concern.
>
> > +static inline void local_lock_cpu_slab(struct kmem_cache *s, unsigned long *flags)
> > +{
> > +     /*
> > +      * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> > +      * can be acquired without a deadlock before invoking the function.
> > +      *
> > +      * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> > +      * disabled context. The lock will always be acquired and if needed it
> > +      * block and sleep until the lock is available.
> > +      *
> > +      * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
> > +      * is always acquired with disabled interrupts meaning it is always
> > +      * possible to it.
> > +      * In NMI context it is needed to check if the lock is acquired. If it is not,
>
> This also could mention the bpf instrumentation context?

Ok.

> > +      * it is safe to acquire it. The trylock semantic is used to tell lockdep
> > +      * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
> > +      * the lock.
> > +      *
> > +      */
> > +     if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > +             BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
>
> Linus might still spot the BUG_ON() and complain, lockdep_assert() would be
> safer maybe :)
> Or just use local_lock_irqsave() with !CONFIG_LOCKDEP as well.

Fair enough. Let's save one branch in the critical path.

> Nit: maybe could be a #define to avoid the unusual need for "&flags" instead
> of "flags" when calling.

When "bool allow_spin" was there in Sebastian's version it definitely
looked cleaner as a proper function,
but now, if (!IS_ENABLED(CONFIG_PREEMPT_RT)) can be
#ifdef CONFIG_PREEMPT_RT
and the comment will look normal (without ugly backslashes)
So yeah. I'll convert it to macro.

> > +/**
> > + * kmalloc_nolock - Allocate an object of given size from any context.
> > + * @size: size to allocate
> > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> > + * @node: node number of the target node.
> > + *
> > + * Return: pointer to the new object or NULL in case of error.
> > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> > + * There is no reason to call it again and expect !NULL.
> > + */
> > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> > +{
> > +     gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> > +     struct kmem_cache *s;
> > +     bool can_retry = true;
> > +     void *ret = ERR_PTR(-EBUSY);
> > +
> > +     VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> > +
> > +     if (unlikely(!size))
> > +             return ZERO_SIZE_PTR;
> > +
> > +     if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
>
> Nit: maybe just due explicit PREEMPT_RT checks when the code isn't about
> lockless fastpaths,

True. I wasn't sure what's better.
do_slab_free() does
if (USE_LOCKLESS_FAST_PATH())
but it's really meant to be PREEMPT_RT,
since 'else' part doesn't make sense otherwise.
It's doing local_lock() without _irqsave() which is
inconsistent with everything else and looks broken
when one doesn't have knowledge of local_lock_internal.h
This patch fixes this part:
-               local_lock(&s->cpu_slab->lock);
+               local_lock_cpu_slab(s, &flags);

Here, in this hunk, if (IS_ENABLED(CONFIG_PREEMPT_RT) might
look better indeed.

> > +             /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> > +             return NULL;
> > +retry:
> > +     if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> > +             return NULL;
> > +     s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> > +
> > +     if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
> > +             /*
> > +              * kmalloc_nolock() is not supported on architectures that
> > +              * don't implement cmpxchg16b, but debug caches don't use
> > +              * per-cpu slab and per-cpu partial slabs. They rely on
> > +              * kmem_cache_node->list_lock, so kmalloc_nolock() can
> > +              * attempt to allocate from debug caches by
> > +              * spin_trylock_irqsave(&n->list_lock, ...)
> > +              */
> > +             return NULL;
> > +
> > +     /*
> > +      * Do not call slab_alloc_node(), since trylock mode isn't
> > +      * compatible with slab_pre_alloc_hook/should_failslab and
> > +      * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
> > +      * and slab_post_alloc_hook() directly.
> > +      *
> > +      * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> > +      * in irq saved region. It assumes that the same cpu will not
> > +      * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> > +      * Therefore use in_nmi() to check whether particular bucket is in
> > +      * irq protected section.
> > +      *
> > +      * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
> > +      * this cpu was interrupted somewhere inside ___slab_alloc() after
> > +      * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
> > +      * In this case fast path with __update_cpu_freelist_fast() is not safe.
> > +      */
> > +#ifndef CONFIG_SLUB_TINY
> > +     if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> > +#endif
> > +             ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
>
> Nit: use IS_DEFINED(CONFIG_SLUB_TINY) to make this look better?

ok.

> > +static void defer_deactivate_slab(struct slab *slab)
> > +{
>
> Nit: for more consistency this could thake the freelist argument and assign
> it here, and not in the caller.

ok.

> > +     struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> > +
> > +     if (llist_add(&slab->llnode, &df->slabs))
> > +             irq_work_queue(&df->work);
> > +}
> > +
> > +void defer_free_barrier(void)
> > +{
> > +     int cpu;
> > +
> > +     for_each_possible_cpu(cpu)
> > +             irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
> > +}
> > +
> >  #ifndef CONFIG_SLUB_TINY
> >  /*
> >   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> > @@ -4575,6 +4857,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >                               struct slab *slab, void *head, void *tail,
> >                               int cnt, unsigned long addr)
> >  {
> > +     /* cnt == 0 signals that it's called from kfree_nolock() */
> > +     bool allow_spin = cnt;
> >       struct kmem_cache_cpu *c;
> >       unsigned long tid;
> >       void **freelist;
> > @@ -4593,10 +4877,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >       barrier();
> >
> >       if (unlikely(slab != c->slab)) {
> > -             __slab_free(s, slab, head, tail, cnt, addr);
> > +             if (unlikely(!allow_spin)) {
> > +                     /*
> > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > +                      * but then it might need to take spin_lock or local_lock
> > +                      * in put_cpu_partial() for further processing.
> > +                      * Avoid the complexity and simply add to a deferred list.
> > +                      */
> > +                     defer_free(s, head);
> > +             } else {
> > +                     __slab_free(s, slab, head, tail, cnt, addr);
> > +             }
> >               return;
> >       }
> >
> > +     if (unlikely(!allow_spin)) {
> > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
>
> Same nit about USE_LOCKLESS_FAST_PATH

Here, I have to disagree unless we fix the couple lines below as well.

> > +                 local_lock_is_locked(&s->cpu_slab->lock)) {
> > +                     defer_free(s, head);
> > +                     return;
> > +             }
> > +             cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> > +             kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> > +     }
> > +
> >       if (USE_LOCKLESS_FAST_PATH()) {
> >               freelist = READ_ONCE(c->freelist);
> >
> > @@ -4607,8 +4911,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >                       goto redo;
> >               }
> >       } else {
> > +             __maybe_unused long flags = 0;
> > +
> >               /* Update the free list under the local lock */
> > -             local_lock(&s->cpu_slab->lock);
> > +             local_lock_cpu_slab(s, &flags);
> >               c = this_cpu_ptr(s->cpu_slab);
> >               if (unlikely(slab != c->slab)) {
> >                       local_unlock(&s->cpu_slab->lock);
> > @@ -4621,7 +4927,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >               c->freelist = head;
> >               c->tid = next_tid(tid);
> >
> > -             local_unlock(&s->cpu_slab->lock);
> > +             local_unlock_cpu_slab(s, &flags);
> >       }
> >       stat_add(s, FREE_FASTPATH, cnt);
> >  }
> > @@ -4844,6 +5150,62 @@ void kfree(const void *object)
> >  }
> >  EXPORT_SYMBOL(kfree);
> >
> > +/*
> > + * Can be called while holding raw_spinlock_t or from IRQ and NMI,
> > + * but only for objects allocated by kmalloc_nolock(),
> > + * since some debug checks (like kmemleak and kfence) were
> > + * skipped on allocation. large_kmalloc is not supported either.
> > + */
> > +void kfree_nolock(const void *object)
> > +{
> > +     struct folio *folio;
> > +     struct slab *slab;
> > +     struct kmem_cache *s;
> > +     void *x = (void *)object;
> > +
> > +     if (unlikely(ZERO_OR_NULL_PTR(object)))
> > +             return;
> > +
> > +     folio = virt_to_folio(object);
> > +     if (unlikely(!folio_test_slab(folio))) {
> > +             WARN(1, "Buggy usage of kfree_nolock");
> > +             return;
> > +     }
> > +
> > +     slab = folio_slab(folio);
> > +     s = slab->slab_cache;
> > +
> > +     memcg_slab_free_hook(s, slab, &x, 1);
> > +     alloc_tagging_slab_free_hook(s, slab, &x, 1);
> > +     /*
> > +      * Unlike slab_free() do NOT call the following:
> > +      * kmemleak_free_recursive(x, s->flags);
> > +      * debug_check_no_locks_freed(x, s->object_size);
> > +      * debug_check_no_obj_freed(x, s->object_size);
> > +      * __kcsan_check_access(x, s->object_size, ..);
> > +      * kfence_free(x);
> > +      * since they take spinlocks.
> > +      */
>
> So here's the bigger concern. What if someone allocates with regular
> kmalloc() so that the debugging stuff is performed as usual, and then tries
> to use kfree_nolock() whre we skip it? You might not be planning such usage,
> but later someone can realize that only their freeing context is limited,
> finds out kfree_nolock() exists and tries to use it?
>
> Can we document this strongly enough? Or even enforce it somehow? Or when
> any of these kinds of debugs above are enabled, we play it safe and use
> defer_free()?

Let's break it one by one.
1.
kmemleak_free_recursive() will miss an object that was recorded
during normal kmalloc() and that's indeed problematic.

2.
debug_check_no_locks_freed() and
debug_check_no_obj_freed()
are somewhat harmless.
We miss checks, but it's not breaking the corresponding features.

3.
__kcsan_check_access() doesn't take locks, but its stack is
so deep and looks to be recursive that I doubt it's safe from
any context.

4.
kfence_free() looks like an existing quirk.
I'm not sure why it's there in the slab free path :)
kfence comment says:
 * KFENCE objects live in a separate page range and are not to be intermixed
 * with regular heap objects (e.g. KFENCE objects must never be added to the
 * allocator freelists). Failing to do so may and will result in heap
 * corruptions, therefore is_kfence_address() must be used to check whether
 * an object requires specific handling.

so it should always be a nop for slab.
I removed the call for peace of mind.

So imo only 1 is dodgy. We can add:
if (!(flags & SLAB_NOLEAKTRACE) && kmemleak_free_enabled)
  defer_free(..);

but it's ugly too.

My preference is to add a comment saying that only objects
allocated by kmalloc_nolock() should be freed by kfree_nolock().