Re: [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().

Harry Yoo <harry.yoo@xxxxxxxxxx> · Wed, 23 Jul 2025 00:52:21 +0900

On Thu, Jul 17, 2025 at 07:16:46PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@xxxxxxxxxx>
>
> kmalloc_nolock() relies on ability of local_trylock_t to detect
> the situation when per-cpu kmem_cache is locked.

I think kmalloc_nolock() should be kmalloc_node_nolock() because
it has `node` parameter?

# Don't specify NUMA node	# Specify NUMA node
kmalloc(size, gfp)		kmalloc_nolock(size, gfp)
kmalloc_node(size, gfp, node)	kmalloc_node_nolock(size, gfp, node)

...just like kmalloc() and kmalloc_node()?

> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> disables IRQs and marks s->cpu_slab->lock as acquired.
> local_lock_is_locked(&s->cpu_slab->lock) returns true when
> slab is in the middle of manipulating per-cpu cache
> of that specific kmem_cache.
> 
> kmalloc_nolock() can be called from any context and can re-enter
> into ___slab_alloc():
>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
>     kmalloc_nolock() -> ___slab_alloc(cache_B)
> or
>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
>     kmalloc_nolock() -> ___slab_alloc(cache_B)

> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> can be acquired without a deadlock before invoking the function.
> If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> retries in a different kmalloc bucket. The second attempt will
> likely succeed, since this cpu locked different kmem_cache.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current _task_. In this case
> re-entrance into the same kmalloc bucket is unsafe, and
> kmalloc_nolock() tries a different bucket that is most likely is
> not locked by the current task. Though it may be locked by a
> different task it's safe to rt_spin_lock() and sleep on it.
>
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.

A question; I was confused for a while thinking "If it can't be called
from NMI and hard irq on PREEMPT_RT, why it can't just spin?"

And I guess it's because even in process context, when kmalloc_nolock()
is called by bpf, it can be called by the task that is holding the local lock
and thus spinning is not allowed. Is that correct?

> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and (in_nmi() or in PREEMPT_RT).
> 
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate,
> while kfree_nolock() always defers to irq_work.
> 
> Note, kfree_nolock() must be called _only_ for objects allocated
> with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> will miss kmemleak/kfence book keeping and will cause false positives.
> large_kmalloc is not supported by either kmalloc_nolock()
> or kfree_nolock().
> 
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> ---
>  include/linux/kasan.h |  13 +-
>  include/linux/slab.h  |   4 +
>  mm/Kconfig            |   1 +
>  mm/kasan/common.c     |   5 +-
>  mm/slab.h             |   6 +
>  mm/slab_common.c      |   3 +
>  mm/slub.c             | 466 +++++++++++++++++++++++++++++++++++++-----
>  7 files changed, 445 insertions(+), 53 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 54444bce218e..7de6da4ee46d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
>  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  		        gfp_t gfp, bool new_slab)
>  {
> +	bool allow_spin = gfpflags_allow_spinning(gfp);
>  	unsigned int objects = objs_per_slab(s, slab);
>  	unsigned long new_exts;
>  	unsigned long old_exts;
> @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  	gfp &= ~OBJCGS_CLEAR_MASK;
>  	/* Prevent recursive extension vector allocation */
>  	gfp |= __GFP_NO_OBJ_EXT;
> -	vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> -			   slab_nid(slab));
> +	if (unlikely(!allow_spin)) {
> +		size_t sz = objects * sizeof(struct slabobj_ext);
> +
> +		vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));

In free_slab_obj_exts(), how do you know slabobj_ext is allocated via
kmalloc_nolock() or kcalloc_node()?

I was going to say "add a new flag to enum objext_flags",
but all lower 3 bits of slab->obj_exts pointer are already in use? oh...

Maybe need a magic trick to add one more flag,
like always align the size with 16?

In practice that should not lead to increase in memory consumption
anyway because most of the kmalloc-* sizes are already at least
16 bytes aligned.

> +	} else {
> +		vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> +				   slab_nid(slab));
> +	}
>  	if (!vec) {
>  		/* Mark vectors which failed to allocate */
>  		if (new_slab)
> +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
> +
>  /*
>   * Called only for kmem_cache_debug() caches to allocate from a freshly
>   * allocated slab. Allocate a single object instead of whole freelist
>   * and put the slab to the partial (or full) list.
>   */
> -static void *alloc_single_from_new_slab(struct kmem_cache *s,
> -					struct slab *slab, int orig_size)
> +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> +					int orig_size, gfp_t gfpflags)
>  {
> +	bool allow_spin = gfpflags_allow_spinning(gfpflags);
>  	int nid = slab_nid(slab);
>  	struct kmem_cache_node *n = get_node(s, nid);
>  	unsigned long flags;
>  	void *object;
>  
> +	if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {

I think alloc_debug_processing() doesn't have to be called under
n->list_lock here because it is a new slab?

That means the code can be something like:

/* allocate one object from slab */
object = slab->freelist;
slab->freelist = get_freepointer(s, object);
slab->inuse = 1;

/* Leak slab if debug checks fails */
if (!alloc_debug_processing())
	return NULL;

/* add slab to per-node partial list */
if (allow_spin) {
	spin_lock_irqsave();
} else if (!spin_trylock_irqsave()) {
	slab->frozen = 1;
	defer_deactivate_slab();
}

> +		/* Unlucky, discard newly allocated slab */
> +		slab->frozen = 1;
> +		defer_deactivate_slab(slab, NULL);
> +		return NULL;
> +	}
>  
>  	object = slab->freelist;
>  	slab->freelist = get_freepointer(s, object);
>  	slab->inuse = 1;
>  
> -	if (!alloc_debug_processing(s, slab, object, orig_size))
> +	if (!alloc_debug_processing(s, slab, object, orig_size)) {
>  		/*
>  		 * It's not really expected that this would fail on a
>  		 * freshly allocated slab, but a concurrent memory
>  		 * corruption in theory could cause that.
> +		 * Leak memory of allocated slab.
>  		 */
> +		if (!allow_spin)
> +			spin_unlock_irqrestore(&n->list_lock, flags);
>  		return NULL;
> +	}
>  
> -	spin_lock_irqsave(&n->list_lock, flags);
> +	if (allow_spin)
> +		spin_lock_irqsave(&n->list_lock, flags);
>  
>  	if (slab->inuse == slab->objects)
>  		add_full(s, n, slab);
> @@ -3164,6 +3201,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>  	}
>  }
>  
> +/*
> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> + * can be acquired without a deadlock before invoking the function.
> + *
> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
> + * and kmalloc() is not used in an unsupported context.
> + *
> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
> + * lockdep_assert() will catch a bug in case:
> + * #1
> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
> + * or
> + * #2
> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> + *
> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> + * disabled context. The lock will always be acquired and if needed it
> + * block and sleep until the lock is available.
> + * #1 is possible in !PREEMP_RT only.

s/PREEMP_RT/PREEMPT_RT/

> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> + *
> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> + */
> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> +#define local_lock_cpu_slab(s, flags)	\
> +	local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> +#else
> +#define local_lock_cpu_slab(s, flags)	\
> +	lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> +#endif
> +
> +#define local_unlock_cpu_slab(s, flags)	\
> +	local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> +
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
>  {
> @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	if (unlikely(!node_match(slab, node))) {
>  		/*
>  		 * same as above but node_match() being false already
> -		 * implies node != NUMA_NO_NODE
> +		 * implies node != NUMA_NO_NODE.
> +		 * Reentrant slub cannot take locks necessary to
> +		 * deactivate_slab, hence ignore node preference.

Now that we have defer_deactivate_slab(), we need to either update the
code or comment?

1. Deactivate slabs when node / pfmemalloc mismatches
or 2. Update comments to explain why it's still undesirable

> +		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
>  		 */
> -		if (!node_isset(node, slab_nodes)) {
> +		if (!node_isset(node, slab_nodes) ||
> +		    !allow_spin) {
>  			node = NUMA_NO_NODE;
>  		} else {
>  			stat(s, ALLOC_NODE_MISMATCH);

> @@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +/*
> + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
> + * to take sleeping spin_locks from __slab_free() and deactivate_slab().
> + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
> + */
> +static void free_deferred_objects(struct irq_work *work)
> +{
> +	struct defer_free *df = container_of(work, struct defer_free, work);
> +	struct llist_head *objs = &df->objects;
> +	struct llist_head *slabs = &df->slabs;
> +	struct llist_node *llnode, *pos, *t;
> +
> +	if (llist_empty(objs) && llist_empty(slabs))
> +		return;
> +
> +	llnode = llist_del_all(objs);
> +	llist_for_each_safe(pos, t, llnode) {
> +		struct kmem_cache *s;
> +		struct slab *slab;
> +		void *x = pos;
> +
> +		slab = virt_to_slab(x);
> +		s = slab->slab_cache;
> +
> +		/*
> +		 * We used freepointer in 'x' to link 'x' into df->objects.
> +		 * Clear it to NULL to avoid false positive detection
> +		 * of "Freepointer corruption".
> +		 */
> +		*(void **)x = NULL;
> +
> +		/* Point 'x' back to the beginning of allocated object */
> +		x -= s->offset;
> +		/*
> +		 * memcg, kasan_slab_pre are already done for 'x'.
> +		 * The only thing left is kasan_poison.
> +		 */
> +		kasan_slab_free(s, x, false, false, true);
> +		__slab_free(s, slab, x, x, 1, _THIS_IP_);
> +	}
> +
> +	llnode = llist_del_all(slabs);
> +	llist_for_each_safe(pos, t, llnode) {
> +		struct slab *slab = container_of(pos, struct slab, llnode);
> +
> +#ifdef CONFIG_SLUB_TINY
> +		discard_slab(slab->slab_cache, slab);

...and with my comment on alloc_single_from_new_slab(),
The slab may not be empty anymore?

> +#else
> +		deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> +#endif
> +	}
> +}
> @@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  	barrier();
>  
>  	if (unlikely(slab != c->slab)) {
> -		__slab_free(s, slab, head, tail, cnt, addr);
> +		if (unlikely(!allow_spin)) {
> +			/*
> +			 * __slab_free() can locklessly cmpxchg16 into a slab,
> +			 * but then it might need to take spin_lock or local_lock
> +			 * in put_cpu_partial() for further processing.
> +			 * Avoid the complexity and simply add to a deferred list.
> +			 */
> +			defer_free(s, head);
> +		} else {
> +			__slab_free(s, slab, head, tail, cnt, addr);
> +		}
>  		return;
>  	}
>  
> +	if (unlikely(!allow_spin)) {
> +		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> +		    local_lock_is_locked(&s->cpu_slab->lock)) {
> +			defer_free(s, head);
> +			return;
> +		}
> +		cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> +		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> +	}

I'm not sure what prevents below from happening

1. slab == c->slab && !allow_spin -> call kasan_slab_free()
2. preempted by something and resume
3. after acquiring local_lock, slab != c->slab, release local_lock, goto redo
4. !allow_spin, so defer_free() will call kasan_slab_free() again later

Perhaps kasan_slab_free() should be called before do_slab_free()
just like normal free path and do not call kasan_slab_free() in deferred
freeing (then you may need to disable KASAN while accessing the deferred
list)?

-- 
Cheers,
Harry / Hyeonggon