On Mon, Sep 8, 2025 at 7:05 PM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote: > > On Mon, Sep 08, 2025 at 05:08:43PM -0700, Alexei Starovoitov wrote: > > On Tue, Aug 12, 2025 at 10:08 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote: > > > > Sorry for the delay. I addressed all other comments > > and will respin soon. > > No worries! Welcome back. > > > Only below question remains.. > > > > > > > > > { > > > > > > > @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > > > > > > > if (unlikely(!node_match(slab, node))) { > > > > > > > /* > > > > > > > * same as above but node_match() being false already > > > > > > > - * implies node != NUMA_NO_NODE > > > > > > > + * implies node != NUMA_NO_NODE. > > > > > > > + * Reentrant slub cannot take locks necessary to > > > > > > > + * deactivate_slab, hence ignore node preference. > > > > > > > > > > > > Now that we have defer_deactivate_slab(), we need to either update the > > > > > > code or comment? > > > > > > > > > > > > 1. Deactivate slabs when node / pfmemalloc mismatches > > > > > > or 2. Update comments to explain why it's still undesirable > > > > > > > > > > Well, defer_deactivate_slab() is a heavy hammer. > > > > > In !SLUB_TINY it pretty much never happens. > > > > > > > > > > This bit: > > > > > > > > > > retry_load_slab: > > > > > > > > > > local_lock_cpu_slab(s, flags); > > > > > if (unlikely(c->slab)) { > > > > > > > > > > is very rare. I couldn't trigger it at all in my stress test. > > > > > > > > > > But in this hunk the node mismatch is not rare, so ignoring node preference > > > > > for kmalloc_nolock() is a much better trade off. > > > > > > But users would have requested that specific node instead of > > > NUMA_NO_NODE because (at least) they think it's worth it. > > > (e.g., allocating kernel data structures tied to specified node) > > > > > > I don't understand why kmalloc()/kmem_cache_alloc() try harder > > > (by deactivating cpu slab) to respect the node parameter, > > > but kmalloc_nolock() does not. > > > > Because kmalloc_nolock() tries to be as least intrusive as possible > > to kmalloc slabs that the rest of the kernel is using. > > > > There won't be kmem_cache_alloc _nolock() version, because > > the algorithm retries from a different bucket when the primary one > > is locked. So it's only kmalloc_nolock() flavor and it takes > > from generic kmalloc slab buckets with or without memcg. > > > > My understanding that c->slab is effectively a cache and in the long > > run all c->slab-s should be stable. > > You're right and that's what makes it inefficient when users call > kmalloc_node() or kmem_cache_alloc_node() every time with different > node id because c->slab will be deactivated too often. Exactly. > > A given cpu should be kmalloc-ing the memory suitable for this local cpu. > > In that sense deactivate_slab is a heavy hammer. kmalloc_nolock() > > is for users who cannot control their running context. imo such > > users shouldn't affect the cache property of c->slab hence ignoring > > node preference for !allow_spin is not great, but imo it's a better > > trade off than defer_deactivate_slab. > > The assumption here is that calling kmalloc_node() with a specific > node other than the local node is a pretty niche case. And thus > kmalloc_nolock() does not want to affect existing kmalloc() users. yes. > But given that assumption and your reasoning, even normal kmalloc_node() > (perhaps even kmem_cache_alloc_node()) users shouldn't fill c->slab with > a slab from a remote node then? Since most of users should be allocating > memory from the local node anyway. Hard to say. I think kmem_cache_alloc_node() users created kmem_cache for their specific needs and node is more likely an actual node id that comes all the way from user space. At least this is how bpf maps are operating. node id is provided at map creation time, then all map elements are preallocated from that node, and in run-time elements are recycled through the prealloc area. numa id is critical for performance for some bpf map use cases like networking, but such strict numa hurts tracing use cases where elements should be local to cpu. So I'd keep deactivate_slab behavior for kmem_cache_alloc_node() and try hard to allocate from specified numa id, but I would do an audit of kmalloc_node() users (they aren't many) and see whether "ignore node id when c->slab is different" approach is a good thing. On a quick glance net/core/skbuff.c might prefer to allocate from local cpu instead of deactivate and allocating the whole new slab which might fail. > > > defer_deactivate_slab() is there for a rare race in retry_load_slab. > > It can be done for !node_match(c->slab, node) too, > > but it feels like a worse-r evil. Especially since kmalloc_nolock() > > doesn't support __GFP_THISNODE. > > ...maybe it is fair to ignore node preference for kmalloc_node() > in a sense that it isn't great to trylock n->list_lock and fail, then > allocate new slabs (even when there are partial slabs available for the > node). Yep. Exactly my thinking. At least that's my preference for kmalloc_nolock() and it is arguably a better choice for generic kmalloc(), but probably not for kmem_cache_alloc_node(). To make such a decision for kmalloc we'd need to collect a ton of data. It's hard to optimize for all possible cases. While for kmalloc_nolock() BPF will be the main user initially and we have a better understanding of trade offs. We can change all that in the future, of course, if intuition turns out to be incorrect.