On 20/06/2025 18:20, Nicolas Dichtel wrote: >>>> It is possible, and very useful, to implement "two-stage routing" by >>>> installing a route that points to a VRF device: >>>> >>>> ip link add vrfNNN type vrf table NNN >>>> ... >>>> ip route add xxxxx/yy dev vrfNNN >>>> >>>> however this causes surprising behaviour with relation to netfilter >>>> hooks. Namely, packets taking such path traverse _output_ nftables >>>> chain, with conntracking information reset. So, for example, even >>>> when "notrack" has been set in the prerouting chain, conntrack entries >>>> will still be created. Script attached below demonstrates this behaviour. >>> You can have a look to this commit to better understand this: >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c9c296adfae9 >> >> I've seen this commit. >> My point is that the packets are _not locally generated_ in this case, >> so it seems wrong to pass them to the _output_ hook, doesn't it? > They are, from the POV of the vrf. The first route sends packets to the vrf > device, which acts like a loopback. I see, this explains the behaviour that I observe. I believe that there are two problems here though: 1. This behaviour is _surprising_. Packets are not really "locally generated", they come from "outside", but treated as is they were locally generated. In my view, it deserves an section in Documentation/networking/vrf.rst (see suggestion below). 2. Using "output" hook makes it impossible(?) to define different nftables rules depending on what vrf was used for routing (because iif is not accessible in the "output" chain). For example, traffic from different tenants, that is routed via different VRFs but egress over the same uplink interface, cannot be assigned different zones. Conntrack entries of different tenants will be mixed. As another example, one cannot disable conntracking of tenant's traffic while continuing to track "true output" traffic from he processes running on the host. Thanks for consideration, Eugene ======================== Suggested update to the documentation: diff --git a/Documentation/networking/vrf.rst b/Documentation/networking/vrf.rst index 0a9a6f968cb9..74c6a69355df 100644 --- a/Documentation/networking/vrf.rst +++ b/Documentation/networking/vrf.rst @@ -61,6 +61,11 @@ domain as a whole. the VRF device. For egress POSTROUTING and OUTPUT rules can be written using either the VRF device or real egress device. +.. [3] When a packet is forwarded to a VRF interface, it gets further + routed according to the route table associated with the VRF, but + processed by the "output" netfilter hook instead of "forwarding" + hook. + Setup ----- 1. VRF device is created with an association to a FIB table.
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature