On Wed, May 28, 2025 at 7:23 PM Florian Westphal <fw@xxxxxxxxx> wrote: > > Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > Our kernel is 6.1.y (also reproduced on 6.14) > > > > Host Network Configuration: > > -------------------------------------- > > > > We run a DNS proxy on our Kubernetes servers with the following iptables rules: > > > > -A PREROUTING -d 169.254.1.2/32 -j DNS-DNAT > > -A DNS-DNAT -d 169.254.1.2/32 -i eth0 -j RETURN > > -A DNS-DNAT -d 169.254.1.2/32 -i eth1 -j RETURN > > -A DNS-DNAT -d 169.254.1.2/32 -i bond0 -j RETURN > > -A DNS-DNAT -j DNAT --to-destination 127.0.0.1 > > -A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000 > > -A POSTROUTING -j KUBE-POSTROUTING > > -A KUBE-POSTROUTING -m mark --mark 0x4000/0x4000 -j MASQUERADE > > > > Container Network Configuration: > > -------------------------------------------- > > Containers use 169.254.1.2 as their DNS resolver: > > > > $ cat /etc/resolve.conf > > nameserver 169.254.1.2 > > > > Issue Description > > ------------------------ > > > > When performing DNS lookups from a container, the query fails with an > > unexpected source port: > > > > $ dig +short @169.254.1.2 A www.google.com > > ;; reply from unexpected source: 169.254.1.2#123, expected 169.254.1.2#53 > > > > The tcpdump is as follows, > > > > 16:47:23.441705 veth9cffd2a4 P IP 10.242.249.78.37562 > > > 169.254.1.2.53: 298+ [1au] A? www.google.com. (55) > > 16:47:23.441705 bridge0 In IP 10.242.249.78.37562 > 127.0.0.1.53: > > 298+ [1au] A? www.google.com. (55) > > 16:47:23.441856 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: > > 298 1/0/1 A 142.250.71.228 (59) > > 16:47:23.441863 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298 > > 1/0/1 A 142.250.71.228 (59) > > 16:47:23.441867 eth1 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298 > > 1/0/1 A 142.250.71.228 (59) > > 16:47:23.441885 eth1 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298 > > 1/0/1 A 142.250.71.228 (59) > > 16:47:23.441885 bond0 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298 > > 1/0/1 A 142.250.71.228 (59) > > 16:47:23.441916 veth9cffd2a4 Out IP 169.254.1.2.124 > > > 10.242.249.78.37562: UDP, length 59 > > > > The DNS response port is unexpectedly changed from 53 to 124, causing > > the application can't receive the response. > > > > We suspected the issue might be related to commit d8f84a9bc7c4 > > ("netfilter: nf_nat: don't try nat source port reallocation for > > reverse dir clash"). After applying this commit, the port remapping no > > longer occurs, but the DNS response is still dropped. > > Thats suspicious, I don't see how this is related. d8f84a9bc7c4 > deals with indepdent action, i.e. > A sends to B and B sends to A, but *at the same time*. > > With a request-response protocol like DNS this should obviously never > happen -- B can't reply before A's request has passed through the stack. Correct, these operations cannot occur simultaneously. However, after implementing this commit, port reallocation no longer occurs. > > > The response is now correctly sent to port 53, but it is dropped in > > __nf_conntrack_confirm(). > > > > We bypassed the issue by modifying __nf_conntrack_confirm() to skip > > the conflicting conntrack entry check: > > > > diff --git a/net/netfilter/nf_conntrack_core.c > > b/net/netfilter/nf_conntrack_core.c > > index 7bee5bd22be2..3481e9d333b0 100644 > > --- a/net/netfilter/nf_conntrack_core.c > > +++ b/net/netfilter/nf_conntrack_core.c > > @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb) > > > > chainlen = 0; > > hlist_nulls_for_each_entry(h, n, > > &nf_conntrack_hash[reply_hash], hnnode) { > > - if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, > > - zone, net)) > > - goto out; > > + //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, > > + // zone, net)) > > + // goto out; > > if (chainlen++ > max_chainlen) { > > chaintoolong: > > NF_CT_STAT_INC(net, chaintoolong); > > I don't understand this bit either. For A/AAAA requests racing in same > direction, nf_ct_resolve_clash() machinery should have handled this > situation. > > And I don't see how you can encounter a DNS reply before at least one > request has been committed to the table -- i.e., the conntrack being > confirmed here should not exist -- the packet should have been picked up > as a reply packet. We've been able to consistently reproduce this behavior. Would you have any recommended debugging approaches we could try? -- Regards Yafang