Re: [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 5 Aug 2025 19:40:31 -0700

On Tue, Jul 22, 2025 at 8:52 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote:
>

Sorry for the delay. PTO plus merge window.

> On Thu, Jul 17, 2025 at 07:16:46PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> >
> > kmalloc_nolock() relies on ability of local_trylock_t to detect
> > the situation when per-cpu kmem_cache is locked.
>
> I think kmalloc_nolock() should be kmalloc_node_nolock() because
> it has `node` parameter?
>
> # Don't specify NUMA node       # Specify NUMA node
> kmalloc(size, gfp)              kmalloc_nolock(size, gfp)
> kmalloc_node(size, gfp, node)   kmalloc_node_nolock(size, gfp, node)
>
> ...just like kmalloc() and kmalloc_node()?

I think this is a wrong pattern to follow.
All this "_node" suffix/rename was done long ago to avoid massive
search-and-replace when NUMA was introduced. Now NUMA is mandatory.
The user of API should think what numa_id to use and if none
they should explicitly say NUMA_NO_NODE.
Hiding behind macros is not a good api.
I hate the "_node_align" suffixes too. It's a silly convention.
Nothing in the kernel follows such an outdated naming scheme.
mm folks should follow what the rest of the kernel does
instead of following a pattern from 20 years ago.

> > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> > disables IRQs and marks s->cpu_slab->lock as acquired.
> > local_lock_is_locked(&s->cpu_slab->lock) returns true when
> > slab is in the middle of manipulating per-cpu cache
> > of that specific kmem_cache.
> >
> > kmalloc_nolock() can be called from any context and can re-enter
> > into ___slab_alloc():
> >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
> >     kmalloc_nolock() -> ___slab_alloc(cache_B)
> > or
> >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
> >     kmalloc_nolock() -> ___slab_alloc(cache_B)
>
> > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> > can be acquired without a deadlock before invoking the function.
> > If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> > retries in a different kmalloc bucket. The second attempt will
> > likely succeed, since this cpu locked different kmem_cache.
> >
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current _task_. In this case
> > re-entrance into the same kmalloc bucket is unsafe, and
> > kmalloc_nolock() tries a different bucket that is most likely is
> > not locked by the current task. Though it may be locked by a
> > different task it's safe to rt_spin_lock() and sleep on it.
> >
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
>
> A question; I was confused for a while thinking "If it can't be called
> from NMI and hard irq on PREEMPT_RT, why it can't just spin?"

It's not safe due to PI issues in RT.
Steven and Sebastian explained it earlier:
https://lore.kernel.org/bpf/20241213124411.105d0f33@xxxxxxxxxxxxxxxxxx/

I don't think I can copy paste the multi page explanation in
commit log or into comments.
So "not safe in NMI or hard irq on RT" is the summary.
Happy to add a few words, but don't know what exactly to say.
If Steven/Sebastian can provide a paragraph I can add it.

> And I guess it's because even in process context, when kmalloc_nolock()
> is called by bpf, it can be called by the task that is holding the local lock
> and thus spinning is not allowed. Is that correct?
>
> > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > and (in_nmi() or in PREEMPT_RT).
> >
> > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > spin_trylock_irqsave(&n->list_lock) to allocate,
> > while kfree_nolock() always defers to irq_work.
> >
> > Note, kfree_nolock() must be called _only_ for objects allocated
> > with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> > will miss kmemleak/kfence book keeping and will cause false positives.
> > large_kmalloc is not supported by either kmalloc_nolock()
> > or kfree_nolock().
> >
> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > ---
> >  include/linux/kasan.h |  13 +-
> >  include/linux/slab.h  |   4 +
> >  mm/Kconfig            |   1 +
> >  mm/kasan/common.c     |   5 +-
> >  mm/slab.h             |   6 +
> >  mm/slab_common.c      |   3 +
> >  mm/slub.c             | 466 +++++++++++++++++++++++++++++++++++++-----
> >  7 files changed, 445 insertions(+), 53 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 54444bce218e..7de6da4ee46d 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
> >  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> >                       gfp_t gfp, bool new_slab)
> >  {
> > +     bool allow_spin = gfpflags_allow_spinning(gfp);
> >       unsigned int objects = objs_per_slab(s, slab);
> >       unsigned long new_exts;
> >       unsigned long old_exts;
> > @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> >       gfp &= ~OBJCGS_CLEAR_MASK;
> >       /* Prevent recursive extension vector allocation */
> >       gfp |= __GFP_NO_OBJ_EXT;
> > -     vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > -                        slab_nid(slab));
> > +     if (unlikely(!allow_spin)) {
> > +             size_t sz = objects * sizeof(struct slabobj_ext);
> > +
> > +             vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
>
> In free_slab_obj_exts(), how do you know slabobj_ext is allocated via
> kmalloc_nolock() or kcalloc_node()?

Technically kmalloc_nolock()->kfree() isn't as bad as
kmalloc()->kfree_nolock(), since kmemleak/kfence can ignore
debug free-ing action without matching alloc side,
but you're right it's better to avoid it.

> I was going to say "add a new flag to enum objext_flags",
> but all lower 3 bits of slab->obj_exts pointer are already in use? oh...
>
> Maybe need a magic trick to add one more flag,
> like always align the size with 16?
>
> In practice that should not lead to increase in memory consumption
> anyway because most of the kmalloc-* sizes are already at least
> 16 bytes aligned.

Yes. That's an option, but I think we can do better.
OBJEXTS_ALLOC_FAIL doesn't need to consume the bit.
Here are two patches that fix this issue:

Subject: [PATCH 1/2] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL

Since the combination of valid upper bits in slab->obj_exts with
OBJEXTS_ALLOC_FAIL bit can never happen,
use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel
instead of (1ull << 2) to free up bit 2.

Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
---
 include/linux/memcontrol.h | 4 +++-
 mm/slub.c                  | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 785173aa0739..daa78665f850 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -341,17 +341,19 @@ enum page_memcg_data_flags {
        __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
 };

+#define __OBJEXTS_ALLOC_FAIL   MEMCG_DATA_OBJEXTS
 #define __FIRST_OBJEXT_FLAG    __NR_MEMCG_DATA_FLAGS

 #else /* CONFIG_MEMCG */

+#define __OBJEXTS_ALLOC_FAIL   (1UL << 0)
 #define __FIRST_OBJEXT_FLAG    (1UL << 0)

 #endif /* CONFIG_MEMCG */

 enum objext_flags {
        /* slabobj_ext vector failed to allocate */
-       OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
+       OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
        /* the next bit after the last actual flag */
        __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
 };
diff --git a/mm/slub.c b/mm/slub.c
index bd4bf2613e7a..16e53bfb310e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1950,7 +1950,7 @@ static inline void
handle_failed_objexts_alloc(unsigned long obj_exts,
         * objects with no tag reference. Mark all references in this
         * vector as empty to avoid warnings later on.
         */
-       if (obj_exts & OBJEXTS_ALLOC_FAIL) {
+       if (obj_exts == OBJEXTS_ALLOC_FAIL) {
                unsigned int i;

                for (i = 0; i < objects; i++)
--
2.47.3

Subject: [PATCH 2/2] slab: Use kfree_nolock() to free obj_exts

Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
---
 include/linux/memcontrol.h | 1 +
 mm/slub.c                  | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index daa78665f850..2e6c33fdd9c5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -354,6 +354,7 @@ enum page_memcg_data_flags {
 enum objext_flags {
     /* slabobj_ext vector failed to allocate */
     OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
+    OBJEXTS_NOSPIN_ALLOC = __FIRST_OBJEXT_FLAG,
     /* the next bit after the last actual flag */
     __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
 };
diff --git a/mm/slub.c b/mm/slub.c
index 16e53bfb310e..417d647f1f02 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2009,6 +2009,8 @@ int alloc_slab_obj_exts(struct slab *slab,
struct kmem_cache *s,
     }

     new_exts = (unsigned long)vec;
+    if (unlikely(!allow_spin))
+        new_exts |= OBJEXTS_NOSPIN_ALLOC;
 #ifdef CONFIG_MEMCG
     new_exts |= MEMCG_DATA_OBJEXTS;
 #endif
@@ -2056,7 +2058,10 @@ static inline void free_slab_obj_exts(struct slab *slab)
      * the extension for obj_exts is expected to be NULL.
      */
     mark_objexts_empty(obj_exts);
-    kfree(obj_exts);
+    if (unlikely(READ_ONCE(slab->obj_exts) & OBJEXTS_NOSPIN_ALLOC))
+        kfree_nolock(obj_exts);
+    else
+        kfree(obj_exts);
     slab->obj_exts = 0;
 }

--
2.47.3


> > +     } else {
> > +             vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > +                                slab_nid(slab));
> > +     }
> >       if (!vec) {
> >               /* Mark vectors which failed to allocate */
> >               if (new_slab)
> > +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
> > +
> >  /*
> >   * Called only for kmem_cache_debug() caches to allocate from a freshly
> >   * allocated slab. Allocate a single object instead of whole freelist
> >   * and put the slab to the partial (or full) list.
> >   */
> > -static void *alloc_single_from_new_slab(struct kmem_cache *s,
> > -                                     struct slab *slab, int orig_size)
> > +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > +                                     int orig_size, gfp_t gfpflags)
> >  {
> > +     bool allow_spin = gfpflags_allow_spinning(gfpflags);
> >       int nid = slab_nid(slab);
> >       struct kmem_cache_node *n = get_node(s, nid);
> >       unsigned long flags;
> >       void *object;
> >
> > +     if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
>
> I think alloc_debug_processing() doesn't have to be called under
> n->list_lock here because it is a new slab?
>
> That means the code can be something like:
>
> /* allocate one object from slab */
> object = slab->freelist;
> slab->freelist = get_freepointer(s, object);
> slab->inuse = 1;
>
> /* Leak slab if debug checks fails */
> if (!alloc_debug_processing())
>         return NULL;
>
> /* add slab to per-node partial list */
> if (allow_spin) {
>         spin_lock_irqsave();
> } else if (!spin_trylock_irqsave()) {
>         slab->frozen = 1;
>         defer_deactivate_slab();
> }

No. That doesn't work. I implemented it this way
before reverting back to spin_trylock_irqsave() in the beginning.
The problem is alloc_debug_processing() will likely succeed
and undoing it is pretty complex.
So it's better to "!allow_spin && !spin_trylock_irqsave()"
before doing expensive and hard to undo alloc_debug_processing().

>
> > +             /* Unlucky, discard newly allocated slab */
> > +             slab->frozen = 1;
> > +             defer_deactivate_slab(slab, NULL);
> > +             return NULL;
> > +     }
> >
> >       object = slab->freelist;
> >       slab->freelist = get_freepointer(s, object);
> >       slab->inuse = 1;
> >
> > -     if (!alloc_debug_processing(s, slab, object, orig_size))
> > +     if (!alloc_debug_processing(s, slab, object, orig_size)) {
> >               /*
> >                * It's not really expected that this would fail on a
> >                * freshly allocated slab, but a concurrent memory
> >                * corruption in theory could cause that.
> > +              * Leak memory of allocated slab.
> >                */
> > +             if (!allow_spin)
> > +                     spin_unlock_irqrestore(&n->list_lock, flags);
> >               return NULL;
> > +     }
> >
> > -     spin_lock_irqsave(&n->list_lock, flags);
> > +     if (allow_spin)
> > +             spin_lock_irqsave(&n->list_lock, flags);
> >
> >       if (slab->inuse == slab->objects)
> >               add_full(s, n, slab);
> > @@ -3164,6 +3201,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >       }
> >  }
> >
> > +/*
> > + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> > + * can be acquired without a deadlock before invoking the function.
> > + *
> > + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
> > + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
> > + * and kmalloc() is not used in an unsupported context.
> > + *
> > + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
> > + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
> > + * lockdep_assert() will catch a bug in case:
> > + * #1
> > + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
> > + * or
> > + * #2
> > + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> > + *
> > + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> > + * disabled context. The lock will always be acquired and if needed it
> > + * block and sleep until the lock is available.
> > + * #1 is possible in !PREEMP_RT only.
>
> s/PREEMP_RT/PREEMPT_RT/

ok.

> > + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> > + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> > + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> > + *
> > + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> > + */
> > +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> > +#define local_lock_cpu_slab(s, flags)        \
> > +     local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> > +#else
> > +#define local_lock_cpu_slab(s, flags)        \
> > +     lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> > +#endif
> > +
> > +#define local_unlock_cpu_slab(s, flags)      \
> > +     local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> > +
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> >  static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
> >  {
> > @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >       if (unlikely(!node_match(slab, node))) {
> >               /*
> >                * same as above but node_match() being false already
> > -              * implies node != NUMA_NO_NODE
> > +              * implies node != NUMA_NO_NODE.
> > +              * Reentrant slub cannot take locks necessary to
> > +              * deactivate_slab, hence ignore node preference.
>
> Now that we have defer_deactivate_slab(), we need to either update the
> code or comment?
>
> 1. Deactivate slabs when node / pfmemalloc mismatches
> or 2. Update comments to explain why it's still undesirable

Well, defer_deactivate_slab() is a heavy hammer.
In !SLUB_TINY it pretty much never happens.

This bit:

retry_load_slab:

        local_lock_cpu_slab(s, flags);
        if (unlikely(c->slab)) {

is very rare. I couldn't trigger it at all in my stress test.

But in this hunk the node mismatch is not rare, so ignoring node preference
for kmalloc_nolock() is a much better trade off.
I'll add a comment that defer_deactivate_slab() is undesired here.

> > +              * kmalloc_nolock() doesn't allow __GFP_THISNODE.
> >                */
> > -             if (!node_isset(node, slab_nodes)) {
> > +             if (!node_isset(node, slab_nodes) ||
> > +                 !allow_spin) {
> >                       node = NUMA_NO_NODE;
> >               } else {
> >                       stat(s, ALLOC_NODE_MISMATCH);
>
> > @@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> >       discard_slab(s, slab);
> >  }
> >
> > +/*
> > + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
> > + * to take sleeping spin_locks from __slab_free() and deactivate_slab().
> > + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
> > + */
> > +static void free_deferred_objects(struct irq_work *work)
> > +{
> > +     struct defer_free *df = container_of(work, struct defer_free, work);
> > +     struct llist_head *objs = &df->objects;
> > +     struct llist_head *slabs = &df->slabs;
> > +     struct llist_node *llnode, *pos, *t;
> > +
> > +     if (llist_empty(objs) && llist_empty(slabs))
> > +             return;
> > +
> > +     llnode = llist_del_all(objs);
> > +     llist_for_each_safe(pos, t, llnode) {
> > +             struct kmem_cache *s;
> > +             struct slab *slab;
> > +             void *x = pos;
> > +
> > +             slab = virt_to_slab(x);
> > +             s = slab->slab_cache;
> > +
> > +             /*
> > +              * We used freepointer in 'x' to link 'x' into df->objects.
> > +              * Clear it to NULL to avoid false positive detection
> > +              * of "Freepointer corruption".
> > +              */
> > +             *(void **)x = NULL;
> > +
> > +             /* Point 'x' back to the beginning of allocated object */
> > +             x -= s->offset;
> > +             /*
> > +              * memcg, kasan_slab_pre are already done for 'x'.
> > +              * The only thing left is kasan_poison.
> > +              */
> > +             kasan_slab_free(s, x, false, false, true);
> > +             __slab_free(s, slab, x, x, 1, _THIS_IP_);
> > +     }
> > +
> > +     llnode = llist_del_all(slabs);
> > +     llist_for_each_safe(pos, t, llnode) {
> > +             struct slab *slab = container_of(pos, struct slab, llnode);
> > +
> > +#ifdef CONFIG_SLUB_TINY
> > +             discard_slab(slab->slab_cache, slab);
>
> ...and with my comment on alloc_single_from_new_slab(),
> The slab may not be empty anymore?

Exactly.
That's another problem with your suggestion in alloc_single_from_new_slab().
That's why I did it as:
if (!allow_spin && !spin_trylock_irqsave(...)

and I still believe it's the right call.
The slab is empty here, so discard_slab() is appropriate.

>
> > +#else
> > +             deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> > +#endif
> > +     }
> > +}
> > @@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >       barrier();
> >
> >       if (unlikely(slab != c->slab)) {
> > -             __slab_free(s, slab, head, tail, cnt, addr);
> > +             if (unlikely(!allow_spin)) {
> > +                     /*
> > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > +                      * but then it might need to take spin_lock or local_lock
> > +                      * in put_cpu_partial() for further processing.
> > +                      * Avoid the complexity and simply add to a deferred list.
> > +                      */
> > +                     defer_free(s, head);
> > +             } else {
> > +                     __slab_free(s, slab, head, tail, cnt, addr);
> > +             }
> >               return;
> >       }
> >
> > +     if (unlikely(!allow_spin)) {
> > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> > +                 local_lock_is_locked(&s->cpu_slab->lock)) {
> > +                     defer_free(s, head);
> > +                     return;
> > +             }
> > +             cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> > +             kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> > +     }
>
> I'm not sure what prevents below from happening
>
> 1. slab == c->slab && !allow_spin -> call kasan_slab_free()
> 2. preempted by something and resume
> 3. after acquiring local_lock, slab != c->slab, release local_lock, goto redo
> 4. !allow_spin, so defer_free() will call kasan_slab_free() again later

yes. it's possible and it's ok.
kasan_slab_free(, no_quarantine == true)
is poison_slab_object() only and one can do it many times.

> Perhaps kasan_slab_free() should be called before do_slab_free()
> just like normal free path and do not call kasan_slab_free() in deferred
> freeing (then you may need to disable KASAN while accessing the deferred
> list)?

I tried that too and didn't like it.
After kasan_slab_free() the object cannot be put on the llist easily.
One needs to do a kasan_reset_tag() dance which uglifies the code a lot.
Double kasan poison in a rare case is fine. There is no harm.