On Mon, May 05, 2025 at 12:01:30PM +0200, Daniel Borkmann wrote: > On 5/5/25 11:48 AM, Paul Chaignon wrote: > > When bpf_redirect_peer is used to redirect packets to a device in > > another network namespace, the skb isn't scrubbed. That can lead skb > > information from one namespace to be "misused" in another namespace. > > > > As one example, this is causing Cilium to drop traffic when using > > bpf_redirect_peer to redirect packets that just went through IPsec > > decryption to a container namespace. The following pwru trace shows (1) > > the packet path from the host's XFRM layer to the container's XFRM > > layer where it's dropped and (2) the number of active skb extensions at > > each function. > > > > NETNS MARK IFACE TUPLE FUNC > > 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 xfrm_rcv_cb > > .active_extensions = (__u8)2, > > 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 xfrm4_rcv_cb > > .active_extensions = (__u8)2, > > 4026533547 d00 eth0 10.244.3.124:35473->10.244.2.158:53 gro_cells_receive > > .active_extensions = (__u8)2, > > [...] > > 4026533547 0 eth0 10.244.3.124:35473->10.244.2.158:53 skb_do_redirect > > .active_extensions = (__u8)2, > > 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 ip_rcv > > .active_extensions = (__u8)2, > > 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 ip_rcv_core > > .active_extensions = (__u8)2, > > [...] > > 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 udp_queue_rcv_one_skb > > .active_extensions = (__u8)2, > > 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 __xfrm_policy_check > > .active_extensions = (__u8)2, > > 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 __xfrm_decode_session > > .active_extensions = (__u8)2, > > 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 security_xfrm_decode_session > > .active_extensions = (__u8)2, > > 4026534999 0 eth0 10.244.3.124:35473->10.244.2.158:53 kfree_skb_reason(SKB_DROP_REASON_XFRM_POLICY) > > .active_extensions = (__u8)2, > > > > In this case, there are no XFRM policies in the container's network > > namespace so the drop is unexpected. When we decrypt the IPsec packet, > > the XFRM state used for decryption is set in the skb extensions. This > > information is preserved across the netns switch. When we reach the > > XFRM policy check in the container's netns, __xfrm_policy_check drops > > the packet with LINUX_MIB_XFRMINNOPOLS because a (container-side) XFRM > > policy can't be found that matches the (host-side) XFRM state used for > > decryption. > > > > This patch fixes this by scrubbing the packet when using > > bpf_redirect_peer, as is done on typical netns switches via veth > > devices. > > > > Fixes: 9aa1206e8f482 ("bpf: Add redirect_peer helper") > > Signed-off-by: Paul Chaignon <paul.chaignon@xxxxxxxxx> > > --- > > net/core/filter.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > index 79cab4d78dc3..12b6b8dbeb51 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -2509,6 +2509,7 @@ int skb_do_redirect(struct sk_buff *skb) > > goto out_drop; > > skb->dev = dev; > > dev_sw_netstats_rx_add(dev, skb->len); > > + skb_scrub_packet(skb, true); > > This should be set to false, and defer the clearing of mark/tstamp > iff needed to the BPF program in the host given this has been out > for quite some time to avoid breakage if someone actively has been > using BPF like that to transfer this information to apps in the > target namespace. Thanks for the review! I've sent a v2 with the suggested change and added a commit to explain in the helper description that those two fields are not cleared. > > > return -EAGAIN; > > } > > return flags & BPF_F_NEIGH ? >