On Wed, Jul 16, 2025 at 3:58 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > On 7/16/25 04:29, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > kmalloc_nolock() relies on ability of local_lock to detect the situation > > ^ local_trylock_t perhaps? > > > when it's locked. > > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in > > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu. > > It can be also true when you call it from the bpf hook, no? Correct. Technically one can disasm ___slab_alloc(), find some instruction after pushfq;cli, add kprobe there, attach bpf prog to kprobe, and call kmalloc_nolock (eventually, when bpf infra switches to kmalloc_nolock). I wouldn't call it malicious, but if somebody does that they are looking for trouble. Even syzbot doesn't do such things. > > In that case retry the operation in a different kmalloc bucket. > > The second attempt will likely succeed, since this cpu locked > > different kmem_cache_cpu. > > > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when > > per-cpu rt_spin_lock is locked by current task. In this case re-entrance > > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries > > a different bucket that is most likely is not locked by the current > > task. Though it may be locked by a different task it's safe to > > rt_spin_lock() on it. > > > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL > > immediately if called from hard irq or NMI in PREEMPT_RT. > > > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked() > > and in_nmi() or in PREEMPT_RT. > > > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on > > spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock() > > always defers to irq_work. > > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > Haven't seen an obvious bug now but will ponder it some more. Meanwhile some > nits and maybe one bit more serious concern. > > > +static inline void local_lock_cpu_slab(struct kmem_cache *s, unsigned long *flags) > > +{ > > + /* > > + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock > > + * can be acquired without a deadlock before invoking the function. > > + * > > + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt > > + * disabled context. The lock will always be acquired and if needed it > > + * block and sleep until the lock is available. > > + * > > + * On !PREEMPT_RT allocations from any context but NMI are safe. The lock > > + * is always acquired with disabled interrupts meaning it is always > > + * possible to it. > > + * In NMI context it is needed to check if the lock is acquired. If it is not, > > This also could mention the bpf instrumentation context? Ok. > > + * it is safe to acquire it. The trylock semantic is used to tell lockdep > > + * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire > > + * the lock. > > + * > > + */ > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) > > + BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags)); > > Linus might still spot the BUG_ON() and complain, lockdep_assert() would be > safer maybe :) > Or just use local_lock_irqsave() with !CONFIG_LOCKDEP as well. Fair enough. Let's save one branch in the critical path. > Nit: maybe could be a #define to avoid the unusual need for "&flags" instead > of "flags" when calling. When "bool allow_spin" was there in Sebastian's version it definitely looked cleaner as a proper function, but now, if (!IS_ENABLED(CONFIG_PREEMPT_RT)) can be #ifdef CONFIG_PREEMPT_RT and the comment will look normal (without ugly backslashes) So yeah. I'll convert it to macro. > > +/** > > + * kmalloc_nolock - Allocate an object of given size from any context. > > + * @size: size to allocate > > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed. > > + * @node: node number of the target node. > > + * > > + * Return: pointer to the new object or NULL in case of error. > > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM. > > + * There is no reason to call it again and expect !NULL. > > + */ > > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node) > > +{ > > + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags; > > + struct kmem_cache *s; > > + bool can_retry = true; > > + void *ret = ERR_PTR(-EBUSY); > > + > > + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO)); > > + > > + if (unlikely(!size)) > > + return ZERO_SIZE_PTR; > > + > > + if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq())) > > Nit: maybe just due explicit PREEMPT_RT checks when the code isn't about > lockless fastpaths, True. I wasn't sure what's better. do_slab_free() does if (USE_LOCKLESS_FAST_PATH()) but it's really meant to be PREEMPT_RT, since 'else' part doesn't make sense otherwise. It's doing local_lock() without _irqsave() which is inconsistent with everything else and looks broken when one doesn't have knowledge of local_lock_internal.h This patch fixes this part: - local_lock(&s->cpu_slab->lock); + local_lock_cpu_slab(s, &flags); Here, in this hunk, if (IS_ENABLED(CONFIG_PREEMPT_RT) might look better indeed. > > + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */ > > + return NULL; > > +retry: > > + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) > > + return NULL; > > + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_); > > + > > + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s)) > > + /* > > + * kmalloc_nolock() is not supported on architectures that > > + * don't implement cmpxchg16b, but debug caches don't use > > + * per-cpu slab and per-cpu partial slabs. They rely on > > + * kmem_cache_node->list_lock, so kmalloc_nolock() can > > + * attempt to allocate from debug caches by > > + * spin_trylock_irqsave(&n->list_lock, ...) > > + */ > > + return NULL; > > + > > + /* > > + * Do not call slab_alloc_node(), since trylock mode isn't > > + * compatible with slab_pre_alloc_hook/should_failslab and > > + * kfence_alloc. Hence call __slab_alloc_node() (at most twice) > > + * and slab_post_alloc_hook() directly. > > + * > > + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair > > + * in irq saved region. It assumes that the same cpu will not > > + * __update_cpu_freelist_fast() into the same (freelist,tid) pair. > > + * Therefore use in_nmi() to check whether particular bucket is in > > + * irq protected section. > > + * > > + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that > > + * this cpu was interrupted somewhere inside ___slab_alloc() after > > + * it did local_lock_irqsave(&s->cpu_slab->lock, flags). > > + * In this case fast path with __update_cpu_freelist_fast() is not safe. > > + */ > > +#ifndef CONFIG_SLUB_TINY > > + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock)) > > +#endif > > + ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size); > > Nit: use IS_DEFINED(CONFIG_SLUB_TINY) to make this look better? ok. > > +static void defer_deactivate_slab(struct slab *slab) > > +{ > > Nit: for more consistency this could thake the freelist argument and assign > it here, and not in the caller. ok. > > + struct defer_free *df = this_cpu_ptr(&defer_free_objects); > > + > > + if (llist_add(&slab->llnode, &df->slabs)) > > + irq_work_queue(&df->work); > > +} > > + > > +void defer_free_barrier(void) > > +{ > > + int cpu; > > + > > + for_each_possible_cpu(cpu) > > + irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work); > > +} > > + > > #ifndef CONFIG_SLUB_TINY > > /* > > * Fastpath with forced inlining to produce a kfree and kmem_cache_free that > > @@ -4575,6 +4857,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s, > > struct slab *slab, void *head, void *tail, > > int cnt, unsigned long addr) > > { > > + /* cnt == 0 signals that it's called from kfree_nolock() */ > > + bool allow_spin = cnt; > > struct kmem_cache_cpu *c; > > unsigned long tid; > > void **freelist; > > @@ -4593,10 +4877,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s, > > barrier(); > > > > if (unlikely(slab != c->slab)) { > > - __slab_free(s, slab, head, tail, cnt, addr); > > + if (unlikely(!allow_spin)) { > > + /* > > + * __slab_free() can locklessly cmpxchg16 into a slab, > > + * but then it might need to take spin_lock or local_lock > > + * in put_cpu_partial() for further processing. > > + * Avoid the complexity and simply add to a deferred list. > > + */ > > + defer_free(s, head); > > + } else { > > + __slab_free(s, slab, head, tail, cnt, addr); > > + } > > return; > > } > > > > + if (unlikely(!allow_spin)) { > > + if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) && > > Same nit about USE_LOCKLESS_FAST_PATH Here, I have to disagree unless we fix the couple lines below as well. > > + local_lock_is_locked(&s->cpu_slab->lock)) { > > + defer_free(s, head); > > + return; > > + } > > + cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */ > > + kasan_slab_free(s, head, false, false, /* skip quarantine */true); > > + } > > + > > if (USE_LOCKLESS_FAST_PATH()) { > > freelist = READ_ONCE(c->freelist); > > > > @@ -4607,8 +4911,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s, > > goto redo; > > } > > } else { > > + __maybe_unused long flags = 0; > > + > > /* Update the free list under the local lock */ > > - local_lock(&s->cpu_slab->lock); > > + local_lock_cpu_slab(s, &flags); > > c = this_cpu_ptr(s->cpu_slab); > > if (unlikely(slab != c->slab)) { > > local_unlock(&s->cpu_slab->lock); > > @@ -4621,7 +4927,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s, > > c->freelist = head; > > c->tid = next_tid(tid); > > > > - local_unlock(&s->cpu_slab->lock); > > + local_unlock_cpu_slab(s, &flags); > > } > > stat_add(s, FREE_FASTPATH, cnt); > > } > > @@ -4844,6 +5150,62 @@ void kfree(const void *object) > > } > > EXPORT_SYMBOL(kfree); > > > > +/* > > + * Can be called while holding raw_spinlock_t or from IRQ and NMI, > > + * but only for objects allocated by kmalloc_nolock(), > > + * since some debug checks (like kmemleak and kfence) were > > + * skipped on allocation. large_kmalloc is not supported either. > > + */ > > +void kfree_nolock(const void *object) > > +{ > > + struct folio *folio; > > + struct slab *slab; > > + struct kmem_cache *s; > > + void *x = (void *)object; > > + > > + if (unlikely(ZERO_OR_NULL_PTR(object))) > > + return; > > + > > + folio = virt_to_folio(object); > > + if (unlikely(!folio_test_slab(folio))) { > > + WARN(1, "Buggy usage of kfree_nolock"); > > + return; > > + } > > + > > + slab = folio_slab(folio); > > + s = slab->slab_cache; > > + > > + memcg_slab_free_hook(s, slab, &x, 1); > > + alloc_tagging_slab_free_hook(s, slab, &x, 1); > > + /* > > + * Unlike slab_free() do NOT call the following: > > + * kmemleak_free_recursive(x, s->flags); > > + * debug_check_no_locks_freed(x, s->object_size); > > + * debug_check_no_obj_freed(x, s->object_size); > > + * __kcsan_check_access(x, s->object_size, ..); > > + * kfence_free(x); > > + * since they take spinlocks. > > + */ > > So here's the bigger concern. What if someone allocates with regular > kmalloc() so that the debugging stuff is performed as usual, and then tries > to use kfree_nolock() whre we skip it? You might not be planning such usage, > but later someone can realize that only their freeing context is limited, > finds out kfree_nolock() exists and tries to use it? > > Can we document this strongly enough? Or even enforce it somehow? Or when > any of these kinds of debugs above are enabled, we play it safe and use > defer_free()? Let's break it one by one. 1. kmemleak_free_recursive() will miss an object that was recorded during normal kmalloc() and that's indeed problematic. 2. debug_check_no_locks_freed() and debug_check_no_obj_freed() are somewhat harmless. We miss checks, but it's not breaking the corresponding features. 3. __kcsan_check_access() doesn't take locks, but its stack is so deep and looks to be recursive that I doubt it's safe from any context. 4. kfence_free() looks like an existing quirk. I'm not sure why it's there in the slab free path :) kfence comment says: * KFENCE objects live in a separate page range and are not to be intermixed * with regular heap objects (e.g. KFENCE objects must never be added to the * allocator freelists). Failing to do so may and will result in heap * corruptions, therefore is_kfence_address() must be used to check whether * an object requires specific handling. so it should always be a nop for slab. I removed the call for peace of mind. So imo only 1 is dodgy. We can add: if (!(flags & SLAB_NOLEAKTRACE) && kmemleak_free_enabled) defer_free(..); but it's ugly too. My preference is to add a comment saying that only objects allocated by kmalloc_nolock() should be freed by kfree_nolock().