[BUG REPORT] netfilter: DNS/SNAT Issue in Kubernetes Environment

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 28 May 2025 17:03:56 +0800

Hello,

We recently encountered an SNAT-related issue in our Kubernetes
environment and have successfully reproduced it with the following
configuration:

kernel
--------
Our kernel is 6.1.y (also reproduced on 6.14)

Host Network Configuration:
--------------------------------------

We run a DNS proxy on our Kubernetes servers with the following iptables rules:

-A PREROUTING -d 169.254.1.2/32 -j DNS-DNAT
-A DNS-DNAT -d 169.254.1.2/32 -i eth0 -j RETURN
-A DNS-DNAT -d 169.254.1.2/32 -i eth1 -j RETURN
-A DNS-DNAT -d 169.254.1.2/32 -i bond0 -j RETURN
-A DNS-DNAT -j DNAT --to-destination 127.0.0.1
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A POSTROUTING -j KUBE-POSTROUTING
-A KUBE-POSTROUTING -m mark --mark 0x4000/0x4000 -j MASQUERADE

Container Network Configuration:
--------------------------------------------
Containers use 169.254.1.2 as their DNS resolver:

$ cat /etc/resolve.conf
nameserver 169.254.1.2

Issue Description
------------------------

When performing DNS lookups from a container, the query fails with an
unexpected source port:

$ dig +short @169.254.1.2 A www.google.com
;; reply from unexpected source: 169.254.1.2#123, expected 169.254.1.2#53

The tcpdump is as follows,

16:47:23.441705 veth9cffd2a4 P   IP 10.242.249.78.37562 >
169.254.1.2.53: 298+ [1au] A? www.google.com. (55)
16:47:23.441705 bridge0 In  IP 10.242.249.78.37562 > 127.0.0.1.53:
298+ [1au] A? www.google.com. (55)
16:47:23.441856 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.37562:
298 1/0/1 A 142.250.71.228 (59)
16:47:23.441863 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
1/0/1 A 142.250.71.228 (59)
16:47:23.441867 eth1  Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298
1/0/1 A 142.250.71.228 (59)
16:47:23.441885 eth1  P   IP 169.254.1.2.53 > 10.242.249.78.37562: 298
1/0/1 A 142.250.71.228 (59)
16:47:23.441885 bond0 P   IP 169.254.1.2.53 > 10.242.249.78.37562: 298
1/0/1 A 142.250.71.228 (59)
16:47:23.441916 veth9cffd2a4 Out IP 169.254.1.2.124 >
10.242.249.78.37562: UDP, length 59

The DNS response port is unexpectedly changed from 53 to 124, causing
the application can't receive the response.

We suspected the issue might be related to commit d8f84a9bc7c4
("netfilter: nf_nat: don't try nat source port reallocation for
reverse dir clash"). After applying this commit, the port remapping no
longer occurs, but the DNS response is still dropped.

16:52:00.968814 veth9cffd2a4 P   IP 10.242.249.78.54482 >
169.254.1.2.53: 15035+ [1au] A? www.google.com. (55)
16:52:00.968814 bridge0 In  IP 10.242.249.78.54482 > 127.0.0.1.53:
15035+ [1au] A? www.google.com. (55)
16:52:00.996661 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
16:52:00.996664 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
16:52:00.996665 eth0  Out IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
16:52:00.996682 eth0  P   IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)
16:52:00.996682 bond0 P   IP 169.254.1.2.53 > 10.242.249.78.54482:
15035 1/0/1 A 142.250.198.100 (59)

The response is now correctly sent to port 53, but it is dropped in
__nf_conntrack_confirm().

We bypassed the issue by modifying __nf_conntrack_confirm()  to skip
the conflicting conntrack entry check:

diff --git a/net/netfilter/nf_conntrack_core.c
b/net/netfilter/nf_conntrack_core.c
index 7bee5bd22be2..3481e9d333b0 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)

        chainlen = 0;
        hlist_nulls_for_each_entry(h, n,
&nf_conntrack_hash[reply_hash], hnnode) {
-               if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
-                                   zone, net))
-                       goto out;
+               //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
+               //                  zone, net))
+               //      goto out;
                if (chainlen++ > max_chainlen) {
 chaintoolong:
                        NF_CT_STAT_INC(net, chaintoolong);

DNS resolution now works as expected.

$ dig +short @169.254.1.2 A www.google.com
142.250.198.100

 The tcpdump is as follows,

16:54:43.618509 veth9cffd2a4 P   IP 10.242.249.78.56805 >
169.254.1.2.53: 7503+ [1au] A? www.google.com. (55)
16:54:43.618509 bridge0 In  IP 10.242.249.78.56805 > 127.0.0.1.53:
7503+ [1au] A? www.google.com. (55)
16:54:43.618666 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618677 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618683 eth1  Out IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618700 eth1  P   IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618700 bond0 P   IP 169.254.1.2.53 > 10.242.249.78.56805:
7503 1/0/1 A 142.250.198.100 (59)
16:54:43.618765 veth9cffd2a4 Out IP 169.254.1.2.53 >
10.242.249.78.56805: 7503 1/0/1 A 142.250.198.100 (59)

The issue remains present in kernel 6.14 as well.

Since we are not deeply familiar with NAT behavior, we would
appreciate guidance on a proper fix or any further debugging.

--
Regards
Yafang