On Tue, Jul 22, 2025 at 8:52 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote: > Sorry for the delay. PTO plus merge window. > On Thu, Jul 17, 2025 at 07:16:46PM -0700, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > kmalloc_nolock() relies on ability of local_trylock_t to detect > > the situation when per-cpu kmem_cache is locked. > > I think kmalloc_nolock() should be kmalloc_node_nolock() because > it has `node` parameter? > > # Don't specify NUMA node # Specify NUMA node > kmalloc(size, gfp) kmalloc_nolock(size, gfp) > kmalloc_node(size, gfp, node) kmalloc_node_nolock(size, gfp, node) > > ...just like kmalloc() and kmalloc_node()? I think this is a wrong pattern to follow. All this "_node" suffix/rename was done long ago to avoid massive search-and-replace when NUMA was introduced. Now NUMA is mandatory. The user of API should think what numa_id to use and if none they should explicitly say NUMA_NO_NODE. Hiding behind macros is not a good api. I hate the "_node_align" suffixes too. It's a silly convention. Nothing in the kernel follows such an outdated naming scheme. mm folks should follow what the rest of the kernel does instead of following a pattern from 20 years ago. > > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags) > > disables IRQs and marks s->cpu_slab->lock as acquired. > > local_lock_is_locked(&s->cpu_slab->lock) returns true when > > slab is in the middle of manipulating per-cpu cache > > of that specific kmem_cache. > > > > kmalloc_nolock() can be called from any context and can re-enter > > into ___slab_alloc(): > > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf -> > > kmalloc_nolock() -> ___slab_alloc(cache_B) > > or > > kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf -> > > kmalloc_nolock() -> ___slab_alloc(cache_B) > > > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock > > can be acquired without a deadlock before invoking the function. > > If that specific per-cpu kmem_cache is busy the kmalloc_nolock() > > retries in a different kmalloc bucket. The second attempt will > > likely succeed, since this cpu locked different kmem_cache. > > > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when > > per-cpu rt_spin_lock is locked by current _task_. In this case > > re-entrance into the same kmalloc bucket is unsafe, and > > kmalloc_nolock() tries a different bucket that is most likely is > > not locked by the current task. Though it may be locked by a > > different task it's safe to rt_spin_lock() and sleep on it. > > > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL > > immediately if called from hard irq or NMI in PREEMPT_RT. > > A question; I was confused for a while thinking "If it can't be called > from NMI and hard irq on PREEMPT_RT, why it can't just spin?" It's not safe due to PI issues in RT. Steven and Sebastian explained it earlier: https://lore.kernel.org/bpf/20241213124411.105d0f33@xxxxxxxxxxxxxxxxxx/ I don't think I can copy paste the multi page explanation in commit log or into comments. So "not safe in NMI or hard irq on RT" is the summary. Happy to add a few words, but don't know what exactly to say. If Steven/Sebastian can provide a paragraph I can add it. > And I guess it's because even in process context, when kmalloc_nolock() > is called by bpf, it can be called by the task that is holding the local lock > and thus spinning is not allowed. Is that correct? > > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked() > > and (in_nmi() or in PREEMPT_RT). > > > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on > > spin_trylock_irqsave(&n->list_lock) to allocate, > > while kfree_nolock() always defers to irq_work. > > > > Note, kfree_nolock() must be called _only_ for objects allocated > > with kmalloc_nolock(). Debug checks (like kmemleak and kfence) > > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj); > > will miss kmemleak/kfence book keeping and will cause false positives. > > large_kmalloc is not supported by either kmalloc_nolock() > > or kfree_nolock(). > > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > --- > > include/linux/kasan.h | 13 +- > > include/linux/slab.h | 4 + > > mm/Kconfig | 1 + > > mm/kasan/common.c | 5 +- > > mm/slab.h | 6 + > > mm/slab_common.c | 3 + > > mm/slub.c | 466 +++++++++++++++++++++++++++++++++++++----- > > 7 files changed, 445 insertions(+), 53 deletions(-) > > > > diff --git a/mm/slub.c b/mm/slub.c > > index 54444bce218e..7de6da4ee46d 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab) > > int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, > > gfp_t gfp, bool new_slab) > > { > > + bool allow_spin = gfpflags_allow_spinning(gfp); > > unsigned int objects = objs_per_slab(s, slab); > > unsigned long new_exts; > > unsigned long old_exts; > > @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, > > gfp &= ~OBJCGS_CLEAR_MASK; > > /* Prevent recursive extension vector allocation */ > > gfp |= __GFP_NO_OBJ_EXT; > > - vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp, > > - slab_nid(slab)); > > + if (unlikely(!allow_spin)) { > > + size_t sz = objects * sizeof(struct slabobj_ext); > > + > > + vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab)); > > In free_slab_obj_exts(), how do you know slabobj_ext is allocated via > kmalloc_nolock() or kcalloc_node()? Technically kmalloc_nolock()->kfree() isn't as bad as kmalloc()->kfree_nolock(), since kmemleak/kfence can ignore debug free-ing action without matching alloc side, but you're right it's better to avoid it. > I was going to say "add a new flag to enum objext_flags", > but all lower 3 bits of slab->obj_exts pointer are already in use? oh... > > Maybe need a magic trick to add one more flag, > like always align the size with 16? > > In practice that should not lead to increase in memory consumption > anyway because most of the kmalloc-* sizes are already at least > 16 bytes aligned. Yes. That's an option, but I think we can do better. OBJEXTS_ALLOC_FAIL doesn't need to consume the bit. Here are two patches that fix this issue: Subject: [PATCH 1/2] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Since the combination of valid upper bits in slab->obj_exts with OBJEXTS_ALLOC_FAIL bit can never happen, use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel instead of (1ull << 2) to free up bit 2. Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> --- include/linux/memcontrol.h | 4 +++- mm/slub.c | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 785173aa0739..daa78665f850 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -341,17 +341,19 @@ enum page_memcg_data_flags { __NR_MEMCG_DATA_FLAGS = (1UL << 2), }; +#define __OBJEXTS_ALLOC_FAIL MEMCG_DATA_OBJEXTS #define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS #else /* CONFIG_MEMCG */ +#define __OBJEXTS_ALLOC_FAIL (1UL << 0) #define __FIRST_OBJEXT_FLAG (1UL << 0) #endif /* CONFIG_MEMCG */ enum objext_flags { /* slabobj_ext vector failed to allocate */ - OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG, + OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL, /* the next bit after the last actual flag */ __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1), }; diff --git a/mm/slub.c b/mm/slub.c index bd4bf2613e7a..16e53bfb310e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1950,7 +1950,7 @@ static inline void handle_failed_objexts_alloc(unsigned long obj_exts, * objects with no tag reference. Mark all references in this * vector as empty to avoid warnings later on. */ - if (obj_exts & OBJEXTS_ALLOC_FAIL) { + if (obj_exts == OBJEXTS_ALLOC_FAIL) { unsigned int i; for (i = 0; i < objects; i++) -- 2.47.3 Subject: [PATCH 2/2] slab: Use kfree_nolock() to free obj_exts Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> --- include/linux/memcontrol.h | 1 + mm/slub.c | 7 ++++++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index daa78665f850..2e6c33fdd9c5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -354,6 +354,7 @@ enum page_memcg_data_flags { enum objext_flags { /* slabobj_ext vector failed to allocate */ OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL, + OBJEXTS_NOSPIN_ALLOC = __FIRST_OBJEXT_FLAG, /* the next bit after the last actual flag */ __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1), }; diff --git a/mm/slub.c b/mm/slub.c index 16e53bfb310e..417d647f1f02 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2009,6 +2009,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, } new_exts = (unsigned long)vec; + if (unlikely(!allow_spin)) + new_exts |= OBJEXTS_NOSPIN_ALLOC; #ifdef CONFIG_MEMCG new_exts |= MEMCG_DATA_OBJEXTS; #endif @@ -2056,7 +2058,10 @@ static inline void free_slab_obj_exts(struct slab *slab) * the extension for obj_exts is expected to be NULL. */ mark_objexts_empty(obj_exts); - kfree(obj_exts); + if (unlikely(READ_ONCE(slab->obj_exts) & OBJEXTS_NOSPIN_ALLOC)) + kfree_nolock(obj_exts); + else + kfree(obj_exts); slab->obj_exts = 0; } -- 2.47.3 > > + } else { > > + vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp, > > + slab_nid(slab)); > > + } > > if (!vec) { > > /* Mark vectors which failed to allocate */ > > if (new_slab) > > +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist); > > + > > /* > > * Called only for kmem_cache_debug() caches to allocate from a freshly > > * allocated slab. Allocate a single object instead of whole freelist > > * and put the slab to the partial (or full) list. > > */ > > -static void *alloc_single_from_new_slab(struct kmem_cache *s, > > - struct slab *slab, int orig_size) > > +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab, > > + int orig_size, gfp_t gfpflags) > > { > > + bool allow_spin = gfpflags_allow_spinning(gfpflags); > > int nid = slab_nid(slab); > > struct kmem_cache_node *n = get_node(s, nid); > > unsigned long flags; > > void *object; > > > > + if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) { > > I think alloc_debug_processing() doesn't have to be called under > n->list_lock here because it is a new slab? > > That means the code can be something like: > > /* allocate one object from slab */ > object = slab->freelist; > slab->freelist = get_freepointer(s, object); > slab->inuse = 1; > > /* Leak slab if debug checks fails */ > if (!alloc_debug_processing()) > return NULL; > > /* add slab to per-node partial list */ > if (allow_spin) { > spin_lock_irqsave(); > } else if (!spin_trylock_irqsave()) { > slab->frozen = 1; > defer_deactivate_slab(); > } No. That doesn't work. I implemented it this way before reverting back to spin_trylock_irqsave() in the beginning. The problem is alloc_debug_processing() will likely succeed and undoing it is pretty complex. So it's better to "!allow_spin && !spin_trylock_irqsave()" before doing expensive and hard to undo alloc_debug_processing(). > > > + /* Unlucky, discard newly allocated slab */ > > + slab->frozen = 1; > > + defer_deactivate_slab(slab, NULL); > > + return NULL; > > + } > > > > object = slab->freelist; > > slab->freelist = get_freepointer(s, object); > > slab->inuse = 1; > > > > - if (!alloc_debug_processing(s, slab, object, orig_size)) > > + if (!alloc_debug_processing(s, slab, object, orig_size)) { > > /* > > * It's not really expected that this would fail on a > > * freshly allocated slab, but a concurrent memory > > * corruption in theory could cause that. > > + * Leak memory of allocated slab. > > */ > > + if (!allow_spin) > > + spin_unlock_irqrestore(&n->list_lock, flags); > > return NULL; > > + } > > > > - spin_lock_irqsave(&n->list_lock, flags); > > + if (allow_spin) > > + spin_lock_irqsave(&n->list_lock, flags); > > > > if (slab->inuse == slab->objects) > > add_full(s, n, slab); > > @@ -3164,6 +3201,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab, > > } > > } > > > > +/* > > + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock > > + * can be acquired without a deadlock before invoking the function. > > + * > > + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is > > + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(), > > + * and kmalloc() is not used in an unsupported context. > > + * > > + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave(). > > + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but > > + * lockdep_assert() will catch a bug in case: > > + * #1 > > + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock() > > + * or > > + * #2 > > + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock() > > + * > > + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt > > + * disabled context. The lock will always be acquired and if needed it > > + * block and sleep until the lock is available. > > + * #1 is possible in !PREEMP_RT only. > > s/PREEMP_RT/PREEMPT_RT/ ok. > > + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock: > > + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) -> > > + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B) > > + * > > + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B > > + */ > > +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP) > > +#define local_lock_cpu_slab(s, flags) \ > > + local_lock_irqsave(&(s)->cpu_slab->lock, flags) > > +#else > > +#define local_lock_cpu_slab(s, flags) \ > > + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags)) > > +#endif > > + > > +#define local_unlock_cpu_slab(s, flags) \ > > + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags) > > + > > #ifdef CONFIG_SLUB_CPU_PARTIAL > > static void __put_partials(struct kmem_cache *s, struct slab *partial_slab) > > { > > @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > > if (unlikely(!node_match(slab, node))) { > > /* > > * same as above but node_match() being false already > > - * implies node != NUMA_NO_NODE > > + * implies node != NUMA_NO_NODE. > > + * Reentrant slub cannot take locks necessary to > > + * deactivate_slab, hence ignore node preference. > > Now that we have defer_deactivate_slab(), we need to either update the > code or comment? > > 1. Deactivate slabs when node / pfmemalloc mismatches > or 2. Update comments to explain why it's still undesirable Well, defer_deactivate_slab() is a heavy hammer. In !SLUB_TINY it pretty much never happens. This bit: retry_load_slab: local_lock_cpu_slab(s, flags); if (unlikely(c->slab)) { is very rare. I couldn't trigger it at all in my stress test. But in this hunk the node mismatch is not rare, so ignoring node preference for kmalloc_nolock() is a much better trade off. I'll add a comment that defer_deactivate_slab() is undesired here. > > + * kmalloc_nolock() doesn't allow __GFP_THISNODE. > > */ > > - if (!node_isset(node, slab_nodes)) { > > + if (!node_isset(node, slab_nodes) || > > + !allow_spin) { > > node = NUMA_NO_NODE; > > } else { > > stat(s, ALLOC_NODE_MISMATCH); > > > @@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab, > > discard_slab(s, slab); > > } > > > > +/* > > + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe > > + * to take sleeping spin_locks from __slab_free() and deactivate_slab(). > > + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore(). > > + */ > > +static void free_deferred_objects(struct irq_work *work) > > +{ > > + struct defer_free *df = container_of(work, struct defer_free, work); > > + struct llist_head *objs = &df->objects; > > + struct llist_head *slabs = &df->slabs; > > + struct llist_node *llnode, *pos, *t; > > + > > + if (llist_empty(objs) && llist_empty(slabs)) > > + return; > > + > > + llnode = llist_del_all(objs); > > + llist_for_each_safe(pos, t, llnode) { > > + struct kmem_cache *s; > > + struct slab *slab; > > + void *x = pos; > > + > > + slab = virt_to_slab(x); > > + s = slab->slab_cache; > > + > > + /* > > + * We used freepointer in 'x' to link 'x' into df->objects. > > + * Clear it to NULL to avoid false positive detection > > + * of "Freepointer corruption". > > + */ > > + *(void **)x = NULL; > > + > > + /* Point 'x' back to the beginning of allocated object */ > > + x -= s->offset; > > + /* > > + * memcg, kasan_slab_pre are already done for 'x'. > > + * The only thing left is kasan_poison. > > + */ > > + kasan_slab_free(s, x, false, false, true); > > + __slab_free(s, slab, x, x, 1, _THIS_IP_); > > + } > > + > > + llnode = llist_del_all(slabs); > > + llist_for_each_safe(pos, t, llnode) { > > + struct slab *slab = container_of(pos, struct slab, llnode); > > + > > +#ifdef CONFIG_SLUB_TINY > > + discard_slab(slab->slab_cache, slab); > > ...and with my comment on alloc_single_from_new_slab(), > The slab may not be empty anymore? Exactly. That's another problem with your suggestion in alloc_single_from_new_slab(). That's why I did it as: if (!allow_spin && !spin_trylock_irqsave(...) and I still believe it's the right call. The slab is empty here, so discard_slab() is appropriate. > > > +#else > > + deactivate_slab(slab->slab_cache, slab, slab->flush_freelist); > > +#endif > > + } > > +} > > @@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s, > > barrier(); > > > > if (unlikely(slab != c->slab)) { > > - __slab_free(s, slab, head, tail, cnt, addr); > > + if (unlikely(!allow_spin)) { > > + /* > > + * __slab_free() can locklessly cmpxchg16 into a slab, > > + * but then it might need to take spin_lock or local_lock > > + * in put_cpu_partial() for further processing. > > + * Avoid the complexity and simply add to a deferred list. > > + */ > > + defer_free(s, head); > > + } else { > > + __slab_free(s, slab, head, tail, cnt, addr); > > + } > > return; > > } > > > > + if (unlikely(!allow_spin)) { > > + if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) && > > + local_lock_is_locked(&s->cpu_slab->lock)) { > > + defer_free(s, head); > > + return; > > + } > > + cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */ > > + kasan_slab_free(s, head, false, false, /* skip quarantine */true); > > + } > > I'm not sure what prevents below from happening > > 1. slab == c->slab && !allow_spin -> call kasan_slab_free() > 2. preempted by something and resume > 3. after acquiring local_lock, slab != c->slab, release local_lock, goto redo > 4. !allow_spin, so defer_free() will call kasan_slab_free() again later yes. it's possible and it's ok. kasan_slab_free(, no_quarantine == true) is poison_slab_object() only and one can do it many times. > Perhaps kasan_slab_free() should be called before do_slab_free() > just like normal free path and do not call kasan_slab_free() in deferred > freeing (then you may need to disable KASAN while accessing the deferred > list)? I tried that too and didn't like it. After kasan_slab_free() the object cannot be put on the llist easily. One needs to do a kasan_reset_tag() dance which uglifies the code a lot. Double kasan poison in a rare case is fine. There is no harm.