Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().

Vlastimil Babka <vbabka@xxxxxxx> · Wed, 16 Jul 2025 12:58:01 +0200

On 7/16/25 04:29, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@xxxxxxxxxx>
> 
> kmalloc_nolock() relies on ability of local_lock to detect the situation

					^ local_trylock_t perhaps?

> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.

It can be also true when you call it from the bpf hook, no?

> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by the current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
> 
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
> 
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and in_nmi() or in PREEMPT_RT.
> 
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> always defers to irq_work.
> 
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>

Haven't seen an obvious bug now but will ponder it some more. Meanwhile some
nits and maybe one bit more serious concern.

> +static inline void local_lock_cpu_slab(struct kmem_cache *s, unsigned long *flags)
> +{
> +	/*
> +	 * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> +	 * can be acquired without a deadlock before invoking the function.
> +	 *
> +	 * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> +	 * disabled context. The lock will always be acquired and if needed it
> +	 * block and sleep until the lock is available.
> +	 *
> +	 * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
> +	 * is always acquired with disabled interrupts meaning it is always
> +	 * possible to it.
> +	 * In NMI context it is needed to check if the lock is acquired. If it is not,

This also could mention the bpf instrumentation context?

> +	 * it is safe to acquire it. The trylock semantic is used to tell lockdep
> +	 * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
> +	 * the lock.
> +	 *
> +	 */
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));

Linus might still spot the BUG_ON() and complain, lockdep_assert() would be
safer maybe :)
Or just use local_lock_irqsave() with !CONFIG_LOCKDEP as well.

Nit: maybe could be a #define to avoid the unusual need for "&flags" instead
of "flags" when calling.

> +/**
> + * kmalloc_nolock - Allocate an object of given size from any context.
> + * @size: size to allocate
> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> + * @node: node number of the target node.
> + *
> + * Return: pointer to the new object or NULL in case of error.
> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> + * There is no reason to call it again and expect !NULL.
> + */
> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> +{
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> +	struct kmem_cache *s;
> +	bool can_retry = true;
> +	void *ret = ERR_PTR(-EBUSY);
> +
> +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> +
> +	if (unlikely(!size))
> +		return ZERO_SIZE_PTR;
> +
> +	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))

Nit: maybe just due explicit PREEMPT_RT checks when the code isn't about
lockless fastpaths,

> +		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> +		return NULL;
> +retry:
> +	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> +		return NULL;
> +	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> +
> +	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
> +		/*
> +		 * kmalloc_nolock() is not supported on architectures that
> +		 * don't implement cmpxchg16b, but debug caches don't use
> +		 * per-cpu slab and per-cpu partial slabs. They rely on
> +		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
> +		 * attempt to allocate from debug caches by
> +		 * spin_trylock_irqsave(&n->list_lock, ...)
> +		 */
> +		return NULL;
> +
> +	/*
> +	 * Do not call slab_alloc_node(), since trylock mode isn't
> +	 * compatible with slab_pre_alloc_hook/should_failslab and
> +	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
> +	 * and slab_post_alloc_hook() directly.
> +	 *
> +	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> +	 * in irq saved region. It assumes that the same cpu will not
> +	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> +	 * Therefore use in_nmi() to check whether particular bucket is in
> +	 * irq protected section.
> +	 *
> +	 * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
> +	 * this cpu was interrupted somewhere inside ___slab_alloc() after
> +	 * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
> +	 * In this case fast path with __update_cpu_freelist_fast() is not safe.
> +	 */
> +#ifndef CONFIG_SLUB_TINY
> +	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> +#endif
> +		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);

Nit: use IS_DEFINED(CONFIG_SLUB_TINY) to make this look better?

> +static void defer_deactivate_slab(struct slab *slab)
> +{

Nit: for more consistency this could thake the freelist argument and assign
it here, and not in the caller.

> +	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> +
> +	if (llist_add(&slab->llnode, &df->slabs))
> +		irq_work_queue(&df->work);
> +}
> +
> +void defer_free_barrier(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -4575,6 +4857,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  				struct slab *slab, void *head, void *tail,
>  				int cnt, unsigned long addr)
>  {
> +	/* cnt == 0 signals that it's called from kfree_nolock() */
> +	bool allow_spin = cnt;
>  	struct kmem_cache_cpu *c;
>  	unsigned long tid;
>  	void **freelist;
> @@ -4593,10 +4877,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  	barrier();
>  
>  	if (unlikely(slab != c->slab)) {
> -		__slab_free(s, slab, head, tail, cnt, addr);
> +		if (unlikely(!allow_spin)) {
> +			/*
> +			 * __slab_free() can locklessly cmpxchg16 into a slab,
> +			 * but then it might need to take spin_lock or local_lock
> +			 * in put_cpu_partial() for further processing.
> +			 * Avoid the complexity and simply add to a deferred list.
> +			 */
> +			defer_free(s, head);
> +		} else {
> +			__slab_free(s, slab, head, tail, cnt, addr);
> +		}
>  		return;
>  	}
>  
> +	if (unlikely(!allow_spin)) {
> +		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&

Same nit about USE_LOCKLESS_FAST_PATH

> +		    local_lock_is_locked(&s->cpu_slab->lock)) {
> +			defer_free(s, head);
> +			return;
> +		}
> +		cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> +		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> +	}
> +
>  	if (USE_LOCKLESS_FAST_PATH()) {
>  		freelist = READ_ONCE(c->freelist);
>  
> @@ -4607,8 +4911,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  			goto redo;
>  		}
>  	} else {
> +		__maybe_unused long flags = 0;
> +
>  		/* Update the free list under the local lock */
> -		local_lock(&s->cpu_slab->lock);
> +		local_lock_cpu_slab(s, &flags);
>  		c = this_cpu_ptr(s->cpu_slab);
>  		if (unlikely(slab != c->slab)) {
>  			local_unlock(&s->cpu_slab->lock);
> @@ -4621,7 +4927,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  		c->freelist = head;
>  		c->tid = next_tid(tid);
>  
> -		local_unlock(&s->cpu_slab->lock);
> +		local_unlock_cpu_slab(s, &flags);
>  	}
>  	stat_add(s, FREE_FASTPATH, cnt);
>  }
> @@ -4844,6 +5150,62 @@ void kfree(const void *object)
>  }
>  EXPORT_SYMBOL(kfree);
>  
> +/*
> + * Can be called while holding raw_spinlock_t or from IRQ and NMI,
> + * but only for objects allocated by kmalloc_nolock(),
> + * since some debug checks (like kmemleak and kfence) were
> + * skipped on allocation. large_kmalloc is not supported either.
> + */
> +void kfree_nolock(const void *object)
> +{
> +	struct folio *folio;
> +	struct slab *slab;
> +	struct kmem_cache *s;
> +	void *x = (void *)object;
> +
> +	if (unlikely(ZERO_OR_NULL_PTR(object)))
> +		return;
> +
> +	folio = virt_to_folio(object);
> +	if (unlikely(!folio_test_slab(folio))) {
> +		WARN(1, "Buggy usage of kfree_nolock");
> +		return;
> +	}
> +
> +	slab = folio_slab(folio);
> +	s = slab->slab_cache;
> +
> +	memcg_slab_free_hook(s, slab, &x, 1);
> +	alloc_tagging_slab_free_hook(s, slab, &x, 1);
> +	/*
> +	 * Unlike slab_free() do NOT call the following:
> +	 * kmemleak_free_recursive(x, s->flags);
> +	 * debug_check_no_locks_freed(x, s->object_size);
> +	 * debug_check_no_obj_freed(x, s->object_size);
> +	 * __kcsan_check_access(x, s->object_size, ..);
> +	 * kfence_free(x);
> +	 * since they take spinlocks.
> +	 */

So here's the bigger concern. What if someone allocates with regular
kmalloc() so that the debugging stuff is performed as usual, and then tries
to use kfree_nolock() whre we skip it? You might not be planning such usage,
but later someone can realize that only their freeing context is limited,
finds out kfree_nolock() exists and tries to use it?

Can we document this strongly enough? Or even enforce it somehow? Or when
any of these kinds of debugs above are enabled, we play it safe and use
defer_free()?

> +	kmsan_slab_free(s, x);
> +	/*
> +	 * If KASAN finds a kernel bug it will do kasan_report_invalid_free()
> +	 * which will call raw_spin_lock_irqsave() which is technically
> +	 * unsafe from NMI, but take chance and report kernel bug.
> +	 * The sequence of
> +	 * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI
> +	 *  -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU
> +	 * is double buggy and deserves to deadlock.
> +	 */
> +	if (kasan_slab_pre_free(s, x))
> +		return;
> +#ifndef CONFIG_SLUB_TINY
> +	do_slab_free(s, slab, x, x, 0, _RET_IP_);
> +#else
> +	defer_free(s, x);
> +#endif
> +}
> +EXPORT_SYMBOL_GPL(kfree_nolock);
> +
>  static __always_inline __realloc_size(2) void *
>  __do_krealloc(const void *p, size_t new_size, gfp_t flags)
>  {