On Mon, Jul 14, 2025 at 04:36:35PM +0200, Florian Westphal wrote: > Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > > On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote: > > > Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > > > > Thanks for the description, this scenario is esoteric. > > > > > > > > Is this bug fully reproducible? > > > > > > No. Unicorn. Only happened once. > > > Everything is based off reading the backtrace and vmcore. > > > > I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again? > > I would not hold my breath. But I don't see anything that prevents the > race described in 4/4, and all the things match in the vmcore, including > increment of clash resolution counter. If you think its too perfect > then ok, we can keep 4/4 back until someone else reports this problem > again. Hm, I think your sequence is possible, it is the SLAB_TYPESAFE_BY_RCU rule that allows for this to occur. Could this rare sequence still happen? cpu x cpu y cpu z found entry E found entry E E is expired <preemption> nf_ct_delete() return E to rcu slab init_conntrack <preemption> NOTE: ct->status not yet set to zero cpu y resumes, it observes E as expired but CONFIRMED: <resumes> nf_ct_expired() -> yes (ct->timeout is 30s) confirmed bit set. > > I would push 1/4 and 3/4 to nf.git to start with. Unless you are 100% sure this fix is needed. > > 3/4 needs 2/4 present as well. I can then resend 4/4 then with the Right, I accidentally skipped that test, it should be also included. > > > - ct->status |= IPS_CONFIRMED; > > > + smp_mb__before_atomic(); > > > + set_bit(IPS_CONFIRMED_BIT, &ct->status) ? > > change. If the status bit is used to synchronize the different threads, I agree this needs to be set_bit(). But I am not sure yet this is sufficient yet. Thanks.