On 7/17/25 04:50, Alexei Starovoitov wrote: > On Wed, Jul 16, 2025 at 3:58 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: >> > In that case retry the operation in a different kmalloc bucket. >> > The second attempt will likely succeed, since this cpu locked >> > different kmem_cache_cpu. >> > >> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when >> > per-cpu rt_spin_lock is locked by current task. In this case re-entrance >> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries >> > a different bucket that is most likely is not locked by the current >> > task. Though it may be locked by a different task it's safe to >> > rt_spin_lock() on it. >> > >> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL >> > immediately if called from hard irq or NMI in PREEMPT_RT. >> > >> > kfree_nolock() defers freeing to irq_work when local_lock_is_locked() >> > and in_nmi() or in PREEMPT_RT. >> > >> > SLUB_TINY config doesn't use local_lock_is_locked() and relies on >> > spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock() >> > always defers to irq_work. >> > >> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> >> >> Haven't seen an obvious bug now but will ponder it some more. Meanwhile some >> nits and maybe one bit more serious concern. >> >> > +static inline void local_lock_cpu_slab(struct kmem_cache *s, unsigned long *flags) >> > +{ >> > + /* >> > + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock >> > + * can be acquired without a deadlock before invoking the function. >> > + * >> > + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt >> > + * disabled context. The lock will always be acquired and if needed it >> > + * block and sleep until the lock is available. >> > + * >> > + * On !PREEMPT_RT allocations from any context but NMI are safe. The lock >> > + * is always acquired with disabled interrupts meaning it is always >> > + * possible to it. >> > + * In NMI context it is needed to check if the lock is acquired. If it is not, >> >> This also could mention the bpf instrumentation context? > > Ok. > >> > + * it is safe to acquire it. The trylock semantic is used to tell lockdep >> > + * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire >> > + * the lock. >> > + * >> > + */ >> > + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) >> > + BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags)); >> >> Linus might still spot the BUG_ON() and complain, lockdep_assert() would be >> safer maybe :) >> Or just use local_lock_irqsave() with !CONFIG_LOCKDEP as well. > > Fair enough. Let's save one branch in the critical path. > >> Nit: maybe could be a #define to avoid the unusual need for "&flags" instead >> of "flags" when calling. > > When "bool allow_spin" was there in Sebastian's version it definitely > looked cleaner as a proper function, > but now, if (!IS_ENABLED(CONFIG_PREEMPT_RT)) can be > #ifdef CONFIG_PREEMPT_RT > and the comment will look normal (without ugly backslashes) > So yeah. I'll convert it to macro. To clarify, the ideal I think would be e.g. #if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP) local_lock_irqsave(); #else lockdep_assert(local_trylock_irqsave()); #endif This should mean that without lockdep we just trust the code to be correct (kmalloc_nolock() using local_lock_held() properly before calling here, and kmalloc() callers not being in an unsupported context, as before this series) with no checking on both RT and !RT. With lockdep, on RT lockdep does its checking in local_lock_irqsave() normally. On !RT we need to use trylock to avoid false positives in nmi, but lockdep_assert() will catch a bug still in case of a true positive. At least I hope I got it right... >> > + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */ >> > + return NULL; >> > +retry: >> > + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) >> > + return NULL; >> > + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_); >> > + >> > + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s)) >> > + /* >> > + * kmalloc_nolock() is not supported on architectures that >> > + * don't implement cmpxchg16b, but debug caches don't use >> > + * per-cpu slab and per-cpu partial slabs. They rely on >> > + * kmem_cache_node->list_lock, so kmalloc_nolock() can >> > + * attempt to allocate from debug caches by >> > + * spin_trylock_irqsave(&n->list_lock, ...) >> > + */ >> > + return NULL; >> > + >> > + /* >> > + * Do not call slab_alloc_node(), since trylock mode isn't >> > + * compatible with slab_pre_alloc_hook/should_failslab and >> > + * kfence_alloc. Hence call __slab_alloc_node() (at most twice) >> > + * and slab_post_alloc_hook() directly. >> > + * >> > + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair >> > + * in irq saved region. It assumes that the same cpu will not >> > + * __update_cpu_freelist_fast() into the same (freelist,tid) pair. >> > + * Therefore use in_nmi() to check whether particular bucket is in >> > + * irq protected section. >> > + * >> > + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that >> > + * this cpu was interrupted somewhere inside ___slab_alloc() after >> > + * it did local_lock_irqsave(&s->cpu_slab->lock, flags). >> > + * In this case fast path with __update_cpu_freelist_fast() is not safe. >> > + */ >> > +#ifndef CONFIG_SLUB_TINY >> > + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock)) >> > +#endif >> > + ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size); >> >> Nit: use IS_DEFINED(CONFIG_SLUB_TINY) to make this look better? > > ok. > >> > +static void defer_deactivate_slab(struct slab *slab) >> > +{ >> >> Nit: for more consistency this could thake the freelist argument and assign >> it here, and not in the caller. > > ok. > >> > + struct defer_free *df = this_cpu_ptr(&defer_free_objects); >> > + >> > + if (llist_add(&slab->llnode, &df->slabs)) >> > + irq_work_queue(&df->work); >> > +} >> > + >> > +void defer_free_barrier(void) >> > +{ >> > + int cpu; >> > + >> > + for_each_possible_cpu(cpu) >> > + irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work); >> > +} >> > + >> > #ifndef CONFIG_SLUB_TINY >> > /* >> > * Fastpath with forced inlining to produce a kfree and kmem_cache_free that >> > @@ -4575,6 +4857,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s, >> > struct slab *slab, void *head, void *tail, >> > int cnt, unsigned long addr) >> > { >> > + /* cnt == 0 signals that it's called from kfree_nolock() */ >> > + bool allow_spin = cnt; >> > struct kmem_cache_cpu *c; >> > unsigned long tid; >> > void **freelist; >> > @@ -4593,10 +4877,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s, >> > barrier(); >> > >> > if (unlikely(slab != c->slab)) { >> > - __slab_free(s, slab, head, tail, cnt, addr); >> > + if (unlikely(!allow_spin)) { >> > + /* >> > + * __slab_free() can locklessly cmpxchg16 into a slab, >> > + * but then it might need to take spin_lock or local_lock >> > + * in put_cpu_partial() for further processing. >> > + * Avoid the complexity and simply add to a deferred list. >> > + */ >> > + defer_free(s, head); >> > + } else { >> > + __slab_free(s, slab, head, tail, cnt, addr); >> > + } >> > return; >> > } >> > >> > + if (unlikely(!allow_spin)) { >> > + if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) && >> >> Same nit about USE_LOCKLESS_FAST_PATH > > Here, I have to disagree unless we fix the couple lines below as well. Ah you're right, we can leave it. >> So here's the bigger concern. What if someone allocates with regular >> kmalloc() so that the debugging stuff is performed as usual, and then tries >> to use kfree_nolock() whre we skip it? You might not be planning such usage, >> but later someone can realize that only their freeing context is limited, >> finds out kfree_nolock() exists and tries to use it? >> >> Can we document this strongly enough? Or even enforce it somehow? Or when >> any of these kinds of debugs above are enabled, we play it safe and use >> defer_free()? > > Let's break it one by one. > 1. > kmemleak_free_recursive() will miss an object that was recorded > during normal kmalloc() and that's indeed problematic. > > 2. > debug_check_no_locks_freed() and > debug_check_no_obj_freed() > are somewhat harmless. > We miss checks, but it's not breaking the corresponding features. > > 3. > __kcsan_check_access() doesn't take locks, but its stack is > so deep and looks to be recursive that I doubt it's safe from > any context. But it's also the case of "miss check but not break anything" right? > 4. > kfence_free() looks like an existing quirk. > I'm not sure why it's there in the slab free path :) > kfence comment says: > * KFENCE objects live in a separate page range and are not to be intermixed > * with regular heap objects (e.g. KFENCE objects must never be added to the > * allocator freelists). Failing to do so may and will result in heap > * corruptions, therefore is_kfence_address() must be used to check whether > * an object requires specific handling. > > so it should always be a nop for slab. Well the point of kfence is that it can inject its objects into slab allocations (so that they are with some probability checked for buffer overflows, UAF etc), so the slab freeing must then recognize and handle them. It also wants to be low overhead for production use, so the freeing isn't doing "if (is_kfence_address())" upfront but deeper. So we could detect them and defer. > I removed the call for peace of mind. > > So imo only 1 is dodgy. We can add: > if (!(flags & SLAB_NOLEAKTRACE) && kmemleak_free_enabled) > defer_free(..); > > but it's ugly too. Hm yeah looks like kmemleak isn't one of those debugging functionalities that can be built-in but no overhead unless enabled on boot, using static keys. > My preference is to add a comment saying that only objects > allocated by kmalloc_nolock() should be freed by kfree_nolock(). We could go with that until someone has a case for changing this, and then handle kmemleak and kfence with defer_free()...