Re: [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().

Harry Yoo <harry.yoo@xxxxxxxxxx> · Wed, 13 Aug 2025 02:08:36 +0900

On Wed, Aug 13, 2025 at 12:11:42AM +0900, Harry Yoo wrote:
> On Tue, Aug 05, 2025 at 07:40:31PM -0700, Alexei Starovoitov wrote:
> > On Tue, Jul 22, 2025 at 8:52 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote:
> > >
> > 
> > Sorry for the delay. PTO plus merge window.
> 
> No problem! Hope you enjoyed PTO :)
> Sorry for the delay as well..
> 
> > > On Thu, Jul 17, 2025 at 07:16:46PM -0700, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > >
> > > > kmalloc_nolock() relies on ability of local_trylock_t to detect
> > > > the situation when per-cpu kmem_cache is locked.
> > >
> > > I think kmalloc_nolock() should be kmalloc_node_nolock() because
> > > it has `node` parameter?
> > >
> > > # Don't specify NUMA node       # Specify NUMA node
> > > kmalloc(size, gfp)              kmalloc_nolock(size, gfp)
> > > kmalloc_node(size, gfp, node)   kmalloc_node_nolock(size, gfp, node)
> > >
> > > ...just like kmalloc() and kmalloc_node()?
> > 
> > I think this is a wrong pattern to follow.
> > All this "_node" suffix/rename was done long ago to avoid massive
> > search-and-replace when NUMA was introduced. Now NUMA is mandatory.
> > The user of API should think what numa_id to use and if none
> > they should explicitly say NUMA_NO_NODE.
> 
> You're probably right, no strong opinion from me.
> 
> > Hiding behind macros is not a good api.
> > I hate the "_node_align" suffixes too. It's a silly convention.
> > Nothing in the kernel follows such an outdated naming scheme.
> > mm folks should follow what the rest of the kernel does
> > instead of following a pattern from 20 years ago.
> 
> That's a new scheme from a very recent patch series that did not land
> mainline yet
> https://lore.kernel.org/linux-mm/20250806124034.1724515-1-vitaly.wool@xxxxxxxxxxx
> 
> > > > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> > > > disables IRQs and marks s->cpu_slab->lock as acquired.
> > > > local_lock_is_locked(&s->cpu_slab->lock) returns true when
> > > > slab is in the middle of manipulating per-cpu cache
> > > > of that specific kmem_cache.
> > > >
> > > > kmalloc_nolock() can be called from any context and can re-enter
> > > > into ___slab_alloc():
> > > >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
> > > >     kmalloc_nolock() -> ___slab_alloc(cache_B)
> > > > or
> > > >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
> > > >     kmalloc_nolock() -> ___slab_alloc(cache_B)
> > >
> > > > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> > > > can be acquired without a deadlock before invoking the function.
> > > > If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> > > > retries in a different kmalloc bucket. The second attempt will
> > > > likely succeed, since this cpu locked different kmem_cache.
> > > >
> > > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > > > per-cpu rt_spin_lock is locked by current _task_. In this case
> > > > re-entrance into the same kmalloc bucket is unsafe, and
> > > > kmalloc_nolock() tries a different bucket that is most likely is
> > > > not locked by the current task. Though it may be locked by a
> > > > different task it's safe to rt_spin_lock() and sleep on it.
> > > >
> > > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > > > immediately if called from hard irq or NMI in PREEMPT_RT.
> > >
> > > A question; I was confused for a while thinking "If it can't be called
> > > from NMI and hard irq on PREEMPT_RT, why it can't just spin?"
> > 
> > It's not safe due to PI issues in RT.
> > Steven and Sebastian explained it earlier:
> > https://lore.kernel.org/bpf/20241213124411.105d0f33@xxxxxxxxxxxxxxxxxx/
> 
> Uh, I was totally missing that point. Thanks for pointing it out!
> 
> > I don't think I can copy paste the multi page explanation in
> > commit log or into comments.
> > So "not safe in NMI or hard irq on RT" is the summary.
> > Happy to add a few words, but don't know what exactly to say.
> > If Steven/Sebastian can provide a paragraph I can add it.
> > 
> > > And I guess it's because even in process context, when kmalloc_nolock()
> > > is called by bpf, it can be called by the task that is holding the local lock
> > > and thus spinning is not allowed. Is that correct?
> > >
> > > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > > > and (in_nmi() or in PREEMPT_RT).
> > > >
> > > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > > > spin_trylock_irqsave(&n->list_lock) to allocate,
> > > > while kfree_nolock() always defers to irq_work.
> > > >
> > > > Note, kfree_nolock() must be called _only_ for objects allocated
> > > > with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> > > > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> > > > will miss kmemleak/kfence book keeping and will cause false positives.
> > > > large_kmalloc is not supported by either kmalloc_nolock()
> > > > or kfree_nolock().
> > > >
> > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > > ---
> > > >  include/linux/kasan.h |  13 +-
> > > >  include/linux/slab.h  |   4 +
> > > >  mm/Kconfig            |   1 +
> > > >  mm/kasan/common.c     |   5 +-
> > > >  mm/slab.h             |   6 +
> > > >  mm/slab_common.c      |   3 +
> > > >  mm/slub.c             | 466 +++++++++++++++++++++++++++++++++++++-----
> > > >  7 files changed, 445 insertions(+), 53 deletions(-)
> > > >
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > index 54444bce218e..7de6da4ee46d 100644
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
> > > >  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > > >                       gfp_t gfp, bool new_slab)
> > > >  {
> > > > +     bool allow_spin = gfpflags_allow_spinning(gfp);
> > > >       unsigned int objects = objs_per_slab(s, slab);
> > > >       unsigned long new_exts;
> > > >       unsigned long old_exts;
> > > > @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > > >       gfp &= ~OBJCGS_CLEAR_MASK;
> > > >       /* Prevent recursive extension vector allocation */
> > > >       gfp |= __GFP_NO_OBJ_EXT;
> > > > -     vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > > > -                        slab_nid(slab));
> > > > +     if (unlikely(!allow_spin)) {
> > > > +             size_t sz = objects * sizeof(struct slabobj_ext);
> > > > +
> > > > +             vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
> > >
> > > In free_slab_obj_exts(), how do you know slabobj_ext is allocated via
> > > kmalloc_nolock() or kcalloc_node()?
> > 
> > Technically kmalloc_nolock()->kfree() isn't as bad as
> > kmalloc()->kfree_nolock(), since kmemleak/kfence can ignore
> > debug free-ing action without matching alloc side,
> > but you're right it's better to avoid it.
> > 
> > > I was going to say "add a new flag to enum objext_flags",
> > > but all lower 3 bits of slab->obj_exts pointer are already in use? oh...
> > >
> > > Maybe need a magic trick to add one more flag,
> > > like always align the size with 16?
> > >
> > > In practice that should not lead to increase in memory consumption
> > > anyway because most of the kmalloc-* sizes are already at least
> > > 16 bytes aligned.
> > 
> > Yes. That's an option, but I think we can do better.
> > OBJEXTS_ALLOC_FAIL doesn't need to consume the bit.
> > Here are two patches that fix this issue:
> > 
> > Subject: [PATCH 1/2] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL
> > 
> > Since the combination of valid upper bits in slab->obj_exts with
> > OBJEXTS_ALLOC_FAIL bit can never happen,
> > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel
> > instead of (1ull << 2) to free up bit 2.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > ---
> 
> This will work, but it would be helpful to add a comment clarifying that
> when bit 0 is set with valid upper bits, it indicates
> MEMCG_DATA_OBJEXTS, but when the upper bits are all zero, it indicates
> OBJEXTS_ALLOC_FAIL.
> 
> When someone looks at the code without checking history it might not
> be obvious at first glance.
> 
> >  include/linux/memcontrol.h | 4 +++-
> >  mm/slub.c                  | 2 +-
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 785173aa0739..daa78665f850 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -341,17 +341,19 @@ enum page_memcg_data_flags {
> >         __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
> >  };
> > 
> > +#define __OBJEXTS_ALLOC_FAIL   MEMCG_DATA_OBJEXTS
> >  #define __FIRST_OBJEXT_FLAG    __NR_MEMCG_DATA_FLAGS
> > 
> >  #else /* CONFIG_MEMCG */
> > 
> > +#define __OBJEXTS_ALLOC_FAIL   (1UL << 0)
> >  #define __FIRST_OBJEXT_FLAG    (1UL << 0)
> > 
> >  #endif /* CONFIG_MEMCG */
> > 
> >  enum objext_flags {
> >         /* slabobj_ext vector failed to allocate */
> > -       OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
> > +       OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
> >         /* the next bit after the last actual flag */
> >         __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
> >  };
> > diff --git a/mm/slub.c b/mm/slub.c
> > index bd4bf2613e7a..16e53bfb310e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1950,7 +1950,7 @@ static inline void
> > handle_failed_objexts_alloc(unsigned long obj_exts,
> >          * objects with no tag reference. Mark all references in this
> >          * vector as empty to avoid warnings later on.
> >          */
> > -       if (obj_exts & OBJEXTS_ALLOC_FAIL) {
> > +       if (obj_exts == OBJEXTS_ALLOC_FAIL) {
> >                 unsigned int i;
> > 
> >                 for (i = 0; i < objects; i++)
> > --
> > 2.47.3
> > 
> > Subject: [PATCH 2/2] slab: Use kfree_nolock() to free obj_exts
> > 
> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > ---
> >  include/linux/memcontrol.h | 1 +
> >  mm/slub.c                  | 7 ++++++-
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index daa78665f850..2e6c33fdd9c5 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -354,6 +354,7 @@ enum page_memcg_data_flags {
> >  enum objext_flags {
> >      /* slabobj_ext vector failed to allocate */
> >      OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
> 
> /* slabobj_ext vector allocated with kmalloc_nolock() */ ?
> 
> > +    OBJEXTS_NOSPIN_ALLOC = __FIRST_OBJEXT_FLAG,
> >      /* the next bit after the last actual flag */
> >      __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
> >  };
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 16e53bfb310e..417d647f1f02 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2009,6 +2009,8 @@ int alloc_slab_obj_exts(struct slab *slab,
> > struct kmem_cache *s,
> >      }
> > 
> >      new_exts = (unsigned long)vec;
> > +    if (unlikely(!allow_spin))
> > +        new_exts |= OBJEXTS_NOSPIN_ALLOC;
> >  #ifdef CONFIG_MEMCG
> >      new_exts |= MEMCG_DATA_OBJEXTS;
> >  #endif
> > @@ -2056,7 +2058,10 @@ static inline void free_slab_obj_exts(struct slab *slab)
> >       * the extension for obj_exts is expected to be NULL.
> >       */
> >      mark_objexts_empty(obj_exts);
> > -    kfree(obj_exts);
> > +    if (unlikely(READ_ONCE(slab->obj_exts) & OBJEXTS_NOSPIN_ALLOC))
> > +        kfree_nolock(obj_exts);
> > +    else
> > +        kfree(obj_exts);
> >      slab->obj_exts = 0;
> >  }
> > 
> > --
> > 2.47.3
> 
> Otherwise looks fine to me.
> 
> > > > +     } else {
> > > > +             vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > > > +                                slab_nid(slab));
> > > > +     }
> > > >       if (!vec) {
> > > >               /* Mark vectors which failed to allocate */
> > > >               if (new_slab)
> > > > +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
> > > > +
> > > >  /*
> > > >   * Called only for kmem_cache_debug() caches to allocate from a freshly
> > > >   * allocated slab. Allocate a single object instead of whole freelist
> > > >   * and put the slab to the partial (or full) list.
> > > >   */
> > > > -static void *alloc_single_from_new_slab(struct kmem_cache *s,
> > > > -                                     struct slab *slab, int orig_size)
> > > > +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > > > +                                     int orig_size, gfp_t gfpflags)
> > > >  {
> > > > +     bool allow_spin = gfpflags_allow_spinning(gfpflags);
> > > >       int nid = slab_nid(slab);
> > > >       struct kmem_cache_node *n = get_node(s, nid);
> > > >       unsigned long flags;
> > > >       void *object;
> > > >
> > > > +     if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
> > >
> > > I think alloc_debug_processing() doesn't have to be called under
> > > n->list_lock here because it is a new slab?
> > >
> > > That means the code can be something like:
> > >
> > > /* allocate one object from slab */
> > > object = slab->freelist;
> > > slab->freelist = get_freepointer(s, object);
> > > slab->inuse = 1;
> > >
> > > /* Leak slab if debug checks fails */
> > > if (!alloc_debug_processing())
> > >         return NULL;
> > >
> > > /* add slab to per-node partial list */
> > > if (allow_spin) {
> > >         spin_lock_irqsave();
> > > } else if (!spin_trylock_irqsave()) {
> > >         slab->frozen = 1;
> > >         defer_deactivate_slab();
> > > }
> > 
> > No. That doesn't work. I implemented it this way
> > before reverting back to spin_trylock_irqsave() in the beginning.
> > The problem is alloc_debug_processing() will likely succeed
> > and undoing it is pretty complex.
> > So it's better to "!allow_spin && !spin_trylock_irqsave()"
> > before doing expensive and hard to undo alloc_debug_processing().
> 
> Gotcha, that makes sense!
> 
> > > > +             /* Unlucky, discard newly allocated slab */
> > > > +             slab->frozen = 1;
> > > > +             defer_deactivate_slab(slab, NULL);
> > > > +             return NULL;
> > > > +     }
> > > >
> > > >       object = slab->freelist;
> > > >       slab->freelist = get_freepointer(s, object);
> > > >       slab->inuse = 1;
> > > >
> > > > -     if (!alloc_debug_processing(s, slab, object, orig_size))
> > > > +     if (!alloc_debug_processing(s, slab, object, orig_size)) {
> > > >               /*
> > > >                * It's not really expected that this would fail on a
> > > >                * freshly allocated slab, but a concurrent memory
> > > >                * corruption in theory could cause that.
> > > > +              * Leak memory of allocated slab.
> > > >                */
> > > > +             if (!allow_spin)
> > > > +                     spin_unlock_irqrestore(&n->list_lock, flags);
> > > >               return NULL;
> > > > +     }
> > > >
> > > > -     spin_lock_irqsave(&n->list_lock, flags);
> > > > +     if (allow_spin)
> > > > +             spin_lock_irqsave(&n->list_lock, flags);
> > > >
> > > >       if (slab->inuse == slab->objects)
> > > >               add_full(s, n, slab);
> > > > + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> > > > + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> > > > + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> > > > + *
> > > > + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> > > > + */
> > > > +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> > > > +#define local_lock_cpu_slab(s, flags)        \
> > > > +     local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> > > > +#else
> > > > +#define local_lock_cpu_slab(s, flags)        \
> > > > +     lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> > > > +#endif
> > > > +
> > > > +#define local_unlock_cpu_slab(s, flags)      \
> > > > +     local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> > > > +
> > > >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> > > >  static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
> > > >  {
> > > > @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > >       if (unlikely(!node_match(slab, node))) {
> > > >               /*
> > > >                * same as above but node_match() being false already
> > > > -              * implies node != NUMA_NO_NODE
> > > > +              * implies node != NUMA_NO_NODE.
> > > > +              * Reentrant slub cannot take locks necessary to
> > > > +              * deactivate_slab, hence ignore node preference.
> > >
> > > Now that we have defer_deactivate_slab(), we need to either update the
> > > code or comment?
> > >
> > > 1. Deactivate slabs when node / pfmemalloc mismatches
> > > or 2. Update comments to explain why it's still undesirable
> > 
> > Well, defer_deactivate_slab() is a heavy hammer.
> > In !SLUB_TINY it pretty much never happens.
> > 
> > This bit:
> > 
> > retry_load_slab:
> > 
> >         local_lock_cpu_slab(s, flags);
> >         if (unlikely(c->slab)) {
> > 
> > is very rare. I couldn't trigger it at all in my stress test.
> > 
> > But in this hunk the node mismatch is not rare, so ignoring node preference
> > for kmalloc_nolock() is a much better trade off.

But users would have requested that specific node instead of
NUMA_NO_NODE because (at least) they think it's worth it.
(e.g., allocating kernel data structures tied to specified node)

I don't understand why kmalloc()/kmem_cache_alloc() try harder
(by deactivating cpu slab) to respect the node parameter,
but kmalloc_nolock() does not.

> > I'll add a comment that defer_deactivate_slab() is undesired here.
> 
> Wait, does that mean kmalloc_nolock() have `node` parameter that is always
> ignored?

Ah, it's not _always_ ignored. At least it tries to grab/allocate slabs
from the specified node in the slowpath. But still, users can't reliably
get objects from the specified node? (even if there's no shortage in
the NUMA node.)

> Why not use NUMA_NO_NODE then instead of adding a special case
> for !allow_spin?

> > > > +              * kmalloc_nolock() doesn't allow __GFP_THISNODE.
> > > >                */
> > > > -             if (!node_isset(node, slab_nodes)) {
> > > > +             if (!node_isset(node, slab_nodes) ||
> > > > +                 !allow_spin) {
> > > >                       node = NUMA_NO_NODE;
> > > >               } else {
> > > >                       stat(s, ALLOC_NODE_MISMATCH);
> > >
> > > > @@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> > > >       discard_slab(s, slab);
> > > >  }
> > > >
> > > > +/*
> > > > + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
> > > > + * to take sleeping spin_locks from __slab_free() and deactivate_slab().
> > > > + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
> > > > + */
> > > > +static void free_deferred_objects(struct irq_work *work)
> > > > +{
> > > > +     struct defer_free *df = container_of(work, struct defer_free, work);
> > > > +     struct llist_head *objs = &df->objects;
> > > > +     struct llist_head *slabs = &df->slabs;
> > > > +     struct llist_node *llnode, *pos, *t;
> > > > +
> > > > +     if (llist_empty(objs) && llist_empty(slabs))
> > > > +             return;
> > > > +
> > > > +     llnode = llist_del_all(objs);
> > > > +     llist_for_each_safe(pos, t, llnode) {
> > > > +             struct kmem_cache *s;
> > > > +             struct slab *slab;
> > > > +             void *x = pos;
> > > > +
> > > > +             slab = virt_to_slab(x);
> > > > +             s = slab->slab_cache;
> > > > +
> > > > +             /*
> > > > +              * We used freepointer in 'x' to link 'x' into df->objects.
> > > > +              * Clear it to NULL to avoid false positive detection
> > > > +              * of "Freepointer corruption".
> > > > +              */
> > > > +             *(void **)x = NULL;
> > > > +
> > > > +             /* Point 'x' back to the beginning of allocated object */
> > > > +             x -= s->offset;
> > > > +             /*
> > > > +              * memcg, kasan_slab_pre are already done for 'x'.
> > > > +              * The only thing left is kasan_poison.
> > > > +              */
> > > > +             kasan_slab_free(s, x, false, false, true);
> > > > +             __slab_free(s, slab, x, x, 1, _THIS_IP_);
> > > > +     }
> > > > +
> > > > +     llnode = llist_del_all(slabs);
> > > > +     llist_for_each_safe(pos, t, llnode) {
> > > > +             struct slab *slab = container_of(pos, struct slab, llnode);
> > > > +
> > > > +#ifdef CONFIG_SLUB_TINY
> > > > +             discard_slab(slab->slab_cache, slab);
> > >
> > > ...and with my comment on alloc_single_from_new_slab(),
> > > The slab may not be empty anymore?
> > 
> > Exactly.
> > That's another problem with your suggestion in alloc_single_from_new_slab().
> > That's why I did it as:
> > if (!allow_spin && !spin_trylock_irqsave(...)
> > 
> > and I still believe it's the right call.
> 
> Yeah, I think it's fine.
> 
> > The slab is empty here, so discard_slab() is appropriate.
> >
> > > > +#else
> > > > +             deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> > > > +#endif
> > > > +     }
> > > > +}
> > > > @@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> > > >       barrier();
> > > >
> > > >       if (unlikely(slab != c->slab)) {
> > > > -             __slab_free(s, slab, head, tail, cnt, addr);
> > > > +             if (unlikely(!allow_spin)) {
> > > > +                     /*
> > > > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > > > +                      * but then it might need to take spin_lock or local_lock
> > > > +                      * in put_cpu_partial() for further processing.
> > > > +                      * Avoid the complexity and simply add to a deferred list.
> > > > +                      */
> > > > +                     defer_free(s, head);
> > > > +             } else {
> > > > +                     __slab_free(s, slab, head, tail, cnt, addr);
> > > > +             }
> > > >               return;
> > > >       }
> > > >
> > > > +     if (unlikely(!allow_spin)) {
> > > > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> > > > +                 local_lock_is_locked(&s->cpu_slab->lock)) {
> > > > +                     defer_free(s, head);
> > > > +                     return;
> > > > +             }
> > > > +             cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> > > > +             kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> > > > +     }
> > >
> > > I'm not sure what prevents below from happening
> > >
> > > 1. slab == c->slab && !allow_spin -> call kasan_slab_free()
> > > 2. preempted by something and resume
> > > 3. after acquiring local_lock, slab != c->slab, release local_lock, goto redo
> > > 4. !allow_spin, so defer_free() will call kasan_slab_free() again later
> > 
> > yes. it's possible and it's ok.
> > kasan_slab_free(, no_quarantine == true)
> > is poison_slab_object() only and one can do it many times.
> >
> > > Perhaps kasan_slab_free() should be called before do_slab_free()
> > > just like normal free path and do not call kasan_slab_free() in deferred
> > > freeing (then you may need to disable KASAN while accessing the deferred
> > > list)?
> > 
> > I tried that too and didn't like it.
> > After kasan_slab_free() the object cannot be put on the llist easily.
> > One needs to do a kasan_reset_tag() dance which uglifies the code a lot.
> > Double kasan poison in a rare case is fine. There is no harm.
> 
> Okay, but we're now depending on kasan_slab_free() being safe to be
> called multiple times on the same object. That would be better
> documented in kasan_slab_free().
> 
> -- 
> Cheers,
> Harry / Hyeonggon

-- 
Cheers,
Harry / Hyeonggon