Re: [PATCH net v2] net: clear the dst when changing skb protocol

Maciej Żenczykowski <maze@xxxxxxxxxx> · Sat, 7 Jun 2025 23:33:39 +0200

On Sat, Jun 7, 2025 at 10:47 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
>
> A not-so-careful NAT46 BPF program can crash the kernel
> if it indiscriminately flips ingress packets from v4 to v6:
>
>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>     ip6_rcv_core (net/ipv6/ip6_input.c:190:20)
>     ipv6_rcv (net/ipv6/ip6_input.c:306:8)
>     process_backlog (net/core/dev.c:6186:4)
>     napi_poll (net/core/dev.c:6906:9)
>     net_rx_action (net/core/dev.c:7028:13)
>     do_softirq (kernel/softirq.c:462:3)
>     netif_rx (net/core/dev.c:5326:3)
>     dev_loopback_xmit (net/core/dev.c:4015:2)
>     ip_mc_finish_output (net/ipv4/ip_output.c:363:8)
>     NF_HOOK (./include/linux/netfilter.h:314:9)
>     ip_mc_output (net/ipv4/ip_output.c:400:5)
>     dst_output (./include/net/dst.h:459:9)
>     ip_local_out (net/ipv4/ip_output.c:130:9)
>     ip_send_skb (net/ipv4/ip_output.c:1496:8)
>     udp_send_skb (net/ipv4/udp.c:1040:8)
>     udp_sendmsg (net/ipv4/udp.c:1328:10)
>
> The output interface has a 4->6 program attached at ingress.
> We try to loop the multicast skb back to the sending socket.
> Ingress BPF runs as part of netif_rx(), pushes a valid v6 hdr
> and changes skb->protocol to v6. We enter ip6_rcv_core which
> tries to use skb_dst(). But the dst is still an IPv4 one left
> after IPv4 mcast output.
>
> Clear the dst in all BPF helpers which change the protocol.
> Also clear the dst if we did an encap or decap as those
> will most likely make the dst stale.
> Try to preserve metadata dsts, those may carry non-routing
> metadata.
>
> Reviewed-by: Maciej Żenczykowski <maze@xxxxxxxxxx>
> Acked-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> Fixes: d219df60a70e ("bpf: Add ipip6 and ip6ip decap support for bpf_skb_adjust_room()")
> Fixes: 1b00e0dfe7d0 ("bpf: update skb->protocol in bpf_skb_net_grow")
> Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper")
> Signed-off-by: Jakub Kicinski <kuba@xxxxxxxxxx>
> ---
> v2:
>  - drop on encap/decap
>  - fix typo (protcol)
>  - add the test to the Makefile
> v1: https://lore.kernel.org/20250604210604.257036-1-kuba@xxxxxxxxxx
>
> I wonder if we should not skip ingress (tc_skip_classify?)
> for looped back packets in the first place. But that doesn't
> seem robust enough vs multiple redirections to solve the crash.
>
> Ignoring LOOPBACK packets (like the NAT46 prog should) doesn't
> work either, since BPF can change pkt_type arbitrarily.
>
> CC: martin.lau@xxxxxxxxx
> CC: daniel@xxxxxxxxxxxxx
> CC: john.fastabend@xxxxxxxxx
> CC: eddyz87@xxxxxxxxx
> CC: sdf@xxxxxxxxxxx
> CC: haoluo@xxxxxxxxxx
> CC: willemb@xxxxxxxxxx
> CC: william.xuanziyang@xxxxxxxxxx
> CC: alan.maguire@xxxxxxxxxx
> CC: bpf@xxxxxxxxxxxxxxx
> CC: edumazet@xxxxxxxxxx
> CC: maze@xxxxxxxxxx
> CC: shuah@xxxxxxxxxx
> CC: linux-kselftest@xxxxxxxxxxxxxxx
> CC: yonghong.song@xxxxxxxxx
> ---
>  tools/testing/selftests/net/Makefile   |  1 +
>  net/core/filter.c                      | 31 +++++++++++++++++++-------
>  tools/testing/selftests/net/nat6to4.sh | 15 +++++++++++++
>  3 files changed, 39 insertions(+), 8 deletions(-)
>  create mode 100755 tools/testing/selftests/net/nat6to4.sh
>
> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
> index ea84b88bcb30..ab996bd22a5f 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -27,6 +27,7 @@ TEST_PROGS += amt.sh
>  TEST_PROGS += unicast_extensions.sh
>  TEST_PROGS += udpgro_fwd.sh
>  TEST_PROGS += udpgro_frglist.sh
> +TEST_PROGS += nat6to4.sh
>  TEST_PROGS += veth.sh
>  TEST_PROGS += ioam6.sh
>  TEST_PROGS += gro.sh
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 327ca73f9cd7..d5917d6446f2 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3406,8 +3406,14 @@ BPF_CALL_3(bpf_skb_change_proto, struct sk_buff *, skb, __be16, proto,
>          * need to be verified first.
>          */
>         ret = bpf_skb_proto_xlat(skb, proto);
> +       if (ret)
> +               return ret;
> +
>         bpf_compute_data_pointers(skb);
> -       return ret;
> +       if (skb_valid_dst(skb))
> +               skb_dst_drop(skb);
> +
> +       return 0;
>  }
>
>  static const struct bpf_func_proto bpf_skb_change_proto_proto = {
> @@ -3554,6 +3560,9 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
>                 else if (skb->protocol == htons(ETH_P_IPV6) &&
>                          flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV4)
>                         skb->protocol = htons(ETH_P_IP);
> +
> +               if (skb_valid_dst(skb))
> +                       skb_dst_drop(skb);
>         }
>
>         if (skb_is_gso(skb)) {
> @@ -3581,6 +3590,7 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
>  static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
>                               u64 flags)
>  {
> +       bool decap = flags & BPF_F_ADJ_ROOM_DECAP_L3_MASK;
>         int ret;
>
>         if (unlikely(flags & ~(BPF_F_ADJ_ROOM_FIXED_GSO |
> @@ -3603,13 +3613,18 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
>         if (unlikely(ret < 0))
>                 return ret;
>
> -       /* Match skb->protocol to new outer l3 protocol */
> -       if (skb->protocol == htons(ETH_P_IP) &&
> -           flags & BPF_F_ADJ_ROOM_DECAP_L3_IPV6)
> -               skb->protocol = htons(ETH_P_IPV6);
> -       else if (skb->protocol == htons(ETH_P_IPV6) &&
> -                flags & BPF_F_ADJ_ROOM_DECAP_L3_IPV4)
> -               skb->protocol = htons(ETH_P_IP);
> +       if (decap) {
> +               /* Match skb->protocol to new outer l3 protocol */
> +               if (skb->protocol == htons(ETH_P_IP) &&
> +                   flags & BPF_F_ADJ_ROOM_DECAP_L3_IPV6)
> +                       skb->protocol = htons(ETH_P_IPV6);
> +               else if (skb->protocol == htons(ETH_P_IPV6) &&
> +                        flags & BPF_F_ADJ_ROOM_DECAP_L3_IPV4)
> +                       skb->protocol = htons(ETH_P_IP);
> +
> +               if (skb_valid_dst(skb))
> +                       skb_dst_drop(skb);
> +       }
>
>         if (skb_is_gso(skb)) {
>                 struct skb_shared_info *shinfo = skb_shinfo(skb);
> diff --git a/tools/testing/selftests/net/nat6to4.sh b/tools/testing/selftests/net/nat6to4.sh
> new file mode 100755
> index 000000000000..0ee859b622a4
> --- /dev/null
> +++ b/tools/testing/selftests/net/nat6to4.sh
> @@ -0,0 +1,15 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +NS="ns-peer-$(mktemp -u XXXXXX)"
> +
> +ip netns add "${NS}"
> +ip -netns "${NS}" link set lo up
> +ip -netns "${NS}" route add default via 127.0.0.2 dev lo
> +
> +tc -n "${NS}" qdisc add dev lo ingress
> +tc -n "${NS}" filter add dev lo ingress prio 4 protocol ip \
> +   bpf object-file nat6to4.bpf.o section schedcls/egress4/snat4 direct-action
> +
> +ip netns exec "${NS}" \
> +   bash -c 'echo 012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789abc | socat - UDP4-DATAGRAM:224.1.0.1:6666,ip-multicast-loop=1'
> --
> 2.49.0

Submit away.

1 meta question: as this is a fix and will thus be backported into
5.4+ LTS, should this be split into two patches? Either making the
test a follow up, or even going with only the crash fix in patch 1 and
putting the 4-in-4 and 6-in-6 behavioural change in patch 2?  We'd end
up in the same state at tip of tree... but it would affect the LTS
backports.  Honestly I'm not even sure what's best.