Re: [BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 28 May 2025 19:41:27 +0800

On Wed, May 28, 2025 at 7:23 PM Florian Westphal <fw@xxxxxxxxx> wrote:
>
> Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > Our kernel is 6.1.y (also reproduced on 6.14)
> >
> > Host Network Configuration:
> > --------------------------------------
> >
> > We run a DNS proxy on our Kubernetes servers with the following iptables rules:
> >
> > -A PREROUTING -d 169.254.1.2/32 -j DNS-DNAT
> > -A DNS-DNAT -d 169.254.1.2/32 -i eth0 -j RETURN
> > -A DNS-DNAT -d 169.254.1.2/32 -i eth1 -j RETURN
> > -A DNS-DNAT -d 169.254.1.2/32 -i bond0 -j RETURN
> > -A DNS-DNAT -j DNAT --to-destination 127.0.0.1
> > -A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
> > -A POSTROUTING -j KUBE-POSTROUTING
> > -A KUBE-POSTROUTING -m mark --mark 0x4000/0x4000 -j MASQUERADE
> >
> > Container Network Configuration:
> > --------------------------------------------
> > Containers use 169.254.1.2 as their DNS resolver:
> >
> > $ cat /etc/resolve.conf
> > nameserver 169.254.1.2
> >
> > Issue Description
> > ------------------------
> >
> > When performing DNS lookups from a container, the query fails with an
> > unexpected source port:
> >
> > $ dig +short @169.254.1.2 A www.google.com
> > ;; reply from unexpected source: 169.254.1.2#123, expected 169.254.1.2#53
> >
> > The tcpdump is as follows,
> >
> > 16:47:23.441705 veth9cffd2a4 P   IP 10.242.249.78.37562 >
> > 169.254.1.2.53: 298+ [1au] A? www.google.com. (55)
> > 16:47:23.441705 bridge0 In  IP 10.242.249.78.37562 > 127.0.0.1.53:
> > 298+ [1au] A? www.google.com. (55)
> > 16:47:23.441856 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.37562:
> > 298 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441863 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> > 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441867 eth1  Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> > 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441885 eth1  P   IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> > 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441885 bond0 P   IP 169.254.1.2.53 > 10.242.249.78.37562: 298
> > 1/0/1 A 142.250.71.228 (59)
> > 16:47:23.441916 veth9cffd2a4 Out IP 169.254.1.2.124 >
> > 10.242.249.78.37562: UDP, length 59
> >
> > The DNS response port is unexpectedly changed from 53 to 124, causing
> > the application can't receive the response.
> >
> > We suspected the issue might be related to commit d8f84a9bc7c4
> > ("netfilter: nf_nat: don't try nat source port reallocation for
> > reverse dir clash"). After applying this commit, the port remapping no
> > longer occurs, but the DNS response is still dropped.
>
> Thats suspicious, I don't see how this is related.  d8f84a9bc7c4
> deals with indepdent action, i.e.
>  A sends to B and B sends to A, but *at the same time*.
>
> With a request-response protocol like DNS this should obviously never
> happen -- B can't reply before A's request has passed through the stack.

Correct, these operations cannot occur simultaneously. However, after
implementing this commit, port reallocation no longer occurs.

>
> > The response is now correctly sent to port 53, but it is dropped in
> > __nf_conntrack_confirm().
> >
> > We bypassed the issue by modifying __nf_conntrack_confirm()  to skip
> > the conflicting conntrack entry check:
> >
> > diff --git a/net/netfilter/nf_conntrack_core.c
> > b/net/netfilter/nf_conntrack_core.c
> > index 7bee5bd22be2..3481e9d333b0 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> >
> >         chainlen = 0;
> >         hlist_nulls_for_each_entry(h, n,
> > &nf_conntrack_hash[reply_hash], hnnode) {
> > -               if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > -                                   zone, net))
> > -                       goto out;
> > +               //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
> > +               //                  zone, net))
> > +               //      goto out;
> >                 if (chainlen++ > max_chainlen) {
> >  chaintoolong:
> >                         NF_CT_STAT_INC(net, chaintoolong);
>
> I don't understand this bit either.  For A/AAAA requests racing in same
> direction, nf_ct_resolve_clash() machinery should have handled this
> situation.
>
> And I don't see how you can encounter a DNS reply before at least one
> request has been committed to the table -- i.e., the conntrack being
> confirmed here should not exist -- the packet should have been picked up
> as a reply packet.

We've been able to consistently reproduce this behavior. Would you
have any recommended debugging approaches we could try?

-- 
Regards
Yafang