On Wed, Jul 16, 2025 at 05:59:41PM +0200, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > > On Mon, Jul 14, 2025 at 04:36:35PM +0200, Florian Westphal wrote: > > > Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > > > > On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote: > > > > > Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > > > > > > Thanks for the description, this scenario is esoteric. > > > > > > > > > > > > Is this bug fully reproducible? > > > > > > > > > > No. Unicorn. Only happened once. > > > > > Everything is based off reading the backtrace and vmcore. > > > > > > > > I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again? > > > > > > I would not hold my breath. But I don't see anything that prevents the > > > race described in 4/4, and all the things match in the vmcore, including > > > increment of clash resolution counter. If you think its too perfect > > > then ok, we can keep 4/4 back until someone else reports this problem > > > again. > > > > Hm, I think your sequence is possible, it is the SLAB_TYPESAFE_BY_RCU rule > > that allows for this to occur. > > > > Could this rare sequence still happen? > > > > cpu x cpu y cpu z > > found entry E found entry E > > E is expired <preemption> > > nf_ct_delete() > > return E to rcu slab > > init_conntrack > > <preemption> NOTE: ct->status not yet set to zero > > > > cpu y resumes, it observes E as expired but CONFIRMED: > > <resumes> > > nf_ct_expired() > > -> yes (ct->timeout is 30s) > > confirmed bit set. > > Yes, that can happen, but then the refcount can't be incremented > as its 0 (-> entry is skipped). Right, refcount zero prevents it. static void nf_ct_gc_expired(struct nf_conn *ct) { if (!refcount_inc_not_zero(&ct->ct_general.use)) return; > If its nonzero but the object was returned > by the kmem cache we have a different kind of bug (free with refcount > 0), > or use-after-free. OK, thanks for explaining, use set_bit() and post v2.