Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > Our kernel is 6.1.y (also reproduced on 6.14) > > Host Network Configuration: > -------------------------------------- > > We run a DNS proxy on our Kubernetes servers with the following iptables rules: > > -A PREROUTING -d 169.254.1.2/32 -j DNS-DNAT > -A DNS-DNAT -d 169.254.1.2/32 -i eth0 -j RETURN > -A DNS-DNAT -d 169.254.1.2/32 -i eth1 -j RETURN > -A DNS-DNAT -d 169.254.1.2/32 -i bond0 -j RETURN > -A DNS-DNAT -j DNAT --to-destination 127.0.0.1 > -A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000 > -A POSTROUTING -j KUBE-POSTROUTING > -A KUBE-POSTROUTING -m mark --mark 0x4000/0x4000 -j MASQUERADE > > Container Network Configuration: > -------------------------------------------- > Containers use 169.254.1.2 as their DNS resolver: > > $ cat /etc/resolve.conf > nameserver 169.254.1.2 > > Issue Description > ------------------------ > > When performing DNS lookups from a container, the query fails with an > unexpected source port: > > $ dig +short @169.254.1.2 A www.google.com > ;; reply from unexpected source: 169.254.1.2#123, expected 169.254.1.2#53 > > The tcpdump is as follows, > > 16:47:23.441705 veth9cffd2a4 P IP 10.242.249.78.37562 > > 169.254.1.2.53: 298+ [1au] A? www.google.com. (55) > 16:47:23.441705 bridge0 In IP 10.242.249.78.37562 > 127.0.0.1.53: > 298+ [1au] A? www.google.com. (55) > 16:47:23.441856 bridge0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: > 298 1/0/1 A 142.250.71.228 (59) > 16:47:23.441863 bond0 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298 > 1/0/1 A 142.250.71.228 (59) > 16:47:23.441867 eth1 Out IP 169.254.1.2.53 > 10.242.249.78.37562: 298 > 1/0/1 A 142.250.71.228 (59) > 16:47:23.441885 eth1 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298 > 1/0/1 A 142.250.71.228 (59) > 16:47:23.441885 bond0 P IP 169.254.1.2.53 > 10.242.249.78.37562: 298 > 1/0/1 A 142.250.71.228 (59) > 16:47:23.441916 veth9cffd2a4 Out IP 169.254.1.2.124 > > 10.242.249.78.37562: UDP, length 59 > > The DNS response port is unexpectedly changed from 53 to 124, causing > the application can't receive the response. > > We suspected the issue might be related to commit d8f84a9bc7c4 > ("netfilter: nf_nat: don't try nat source port reallocation for > reverse dir clash"). After applying this commit, the port remapping no > longer occurs, but the DNS response is still dropped. Thats suspicious, I don't see how this is related. d8f84a9bc7c4 deals with indepdent action, i.e. A sends to B and B sends to A, but *at the same time*. With a request-response protocol like DNS this should obviously never happen -- B can't reply before A's request has passed through the stack. > The response is now correctly sent to port 53, but it is dropped in > __nf_conntrack_confirm(). > > We bypassed the issue by modifying __nf_conntrack_confirm() to skip > the conflicting conntrack entry check: > > diff --git a/net/netfilter/nf_conntrack_core.c > b/net/netfilter/nf_conntrack_core.c > index 7bee5bd22be2..3481e9d333b0 100644 > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -1245,9 +1245,9 @@ __nf_conntrack_confirm(struct sk_buff *skb) > > chainlen = 0; > hlist_nulls_for_each_entry(h, n, > &nf_conntrack_hash[reply_hash], hnnode) { > - if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, > - zone, net)) > - goto out; > + //if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, > + // zone, net)) > + // goto out; > if (chainlen++ > max_chainlen) { > chaintoolong: > NF_CT_STAT_INC(net, chaintoolong); I don't understand this bit either. For A/AAAA requests racing in same direction, nf_ct_resolve_clash() machinery should have handled this situation. And I don't see how you can encounter a DNS reply before at least one request has been committed to the table -- i.e., the conntrack being confirmed here should not exist -- the packet should have been picked up as a reply packet.