Re: [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.

Florian Westphal <fw@xxxxxxxxx> · Mon, 17 Feb 2025 16:35:48 +0100

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> wrote:
> > > This can also be achieved with RCU: Each reader of the private pointer
> > > will be with in an RCU read section. The new pointer will be published
> > > with rcu_assign_pointer() and synchronize_rcu() is used to wait until
> > > each reader left its critical section.
> > 
> > Note we had this before and it was reverted in
> > d3d40f237480 ("Revert "netfilter: x_tables: Switch synchronization to RCU"")
> > 
> > I'm not saying its wrong, but I think you need a plan b when the same
> > complaints wrt table replace slowdown come in.
> > 
> > And unfortunately I can't find a solution for this, unless we keep
> > either the existing wait-scheme for counters sake or we accept
> > that some counter update might be lost between copy to userland and
> > destruction (it would need to use rcu_work or similar, the xt
> > target/module destroy callbacks can sleep).
> 
> Urgh. Is this fast & frequent update a real-world thing or a benchmark
> of some sort? I mean adding rule after rule is possible but…

No idea, its certainly bad design (iptables-restore exists for a
reason).

> I used here synchronize_rcu() and there is also
> synchronize_rcu_expedited() but I do hate it. With everything.

Agreed.

> What are the counters used in userland for? I've seen that they are
> copied but did not understood why.
>   iptables-legacy -A INPUT -j ACCEPT
> ends up in xt_replace_table() but iptables-nft doesn't. Different
> interface, better design? Or I just used legacy and now it is considered
> as the only?

iptables-nft uses the nftables netlink API, I don't think it suffers
from the preempt_rt issues you're resolving in the setsockopt api.

It won't go anywhere near xt_replace_table and it doesn't use the
xtables binary format on the user/kernel api side.

> I see two invocations on iptables-legacy-restore.
> 
> But the question remains: Why copy counters after replacing the rule
> set?

Good question.  What I can gather from
https://git.netfilter.org/iptables/tree/libiptc/libiptc.c#n2597

After replace we get copy of the old counters, depending on mode
we can force-update counters post-replace in kernel, via

	ret = setsockopt(handle->sockfd, TC_IPPROTO, SO_SET_ADD_COUNTERS,
			 newcounters, counterlen);

I suspect this is whats used by 'iptables -L -v -Z INPUT'.
But I don't know if anyone uses this in practice.

Maybe Pablo remembers the details, this is ancient code by kernel
standards.

I think we might get away with losing counters on the update
side, i.e. rcu_update_pointer() so new CPUs won't find the old
binary blob, copy the counters, then defer destruction via rcu_work
or similar and accept that the counters might have marginally
changed after the copy to userland and before last cpu left its
rcu critical section.