Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Thu, 3 Jul 2025 13:17:13 +0200

On 02/07/2025 18.05, Stanislav Fomichev wrote:
On 07/02, Jesper Dangaard Brouer wrote:
This patch series introduces a mechanism for an XDP program to store RX
metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
into the xdp_frame. These stored hints are then used to populate the
corresponding fields in the SKB that is created from the xdp_frame
following an XDP_REDIRECT.

The chosen RX metadata hints intentionally map to the existing NIC
hardware metadata that can be read via kfuncs [1]. While this design
allows a BPF program to read and propagate existing hardware hints, our
primary motivation is to enable setting custom values. This is important
for use cases where the hardware-provided information is insufficient or
needs to be calculated based on packet contents unavailable to the
hardware.

The primary motivation for this feature is to enable scalable load
balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
the networking stack later calculates a software hash based on the outer
headers. For a single tunnel, these outer headers are often identical,
causing all packets to be assigned the same hash. This collapses all
traffic onto a single RX queue, creating a performance bottleneck and
defeating receive-side scaling (RSS).

Our immediate use case involves load balancing IPsec traffic. For such
tunnelled traffic, any hardware-provided RX hash is calculated on the
outer headers and is therefore incorrect for distributing inner flows.
There is no reason to read the existing value, as it must be recalculated.
In our XDP program, we perform a partial decryption to access the inner
headers and calculate a new load-balancing hash, which provides better
flow distribution. However, without this patch set, there is no way to
persist this new hash for the network stack to use post-redirect.

This series solves the problem by introducing new BPF kfuncs that allow an
XDP program to write e.g. the hash value into the xdp_frame. The
__xdp_build_skb_from_frame() function is modified to use this stored value
to set skb->hash on the newly created SKB. As a result, the veth driver's
queue selection logic uses the BPF-supplied hash, achieving proper
traffic distribution across multiple CPU cores. This also ensures that
consumers, like the GRO engine, can operate effectively.

We considered XDP traits as an alternative to adding static members to
struct xdp_frame. Given the immediate need for this functionality and the
current development status of traits, we believe this approach is a
pragmatic solution. We are open to migrating to a traits-based
implementation if and when they become a generally accepted mechanism for
such extensions.

[1] https://docs.kernel.org/networking/xdp-rx-metadata.html
---
V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/

No change log?

We have fixed selftest as requested by Alexie.
And we have updated cover-letter and doc as you Stanislav requested.

Btw, any feedback on the following from v1?
- https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/

Addressed as updated cover-letter and documentation. I hope this helps 
reviewers understand the use-case, as the discussion turn into "how do 
we transfer all HW metadata", which is NOT what we want (and a waste of 
precious cycles).

For our use-case, it doesn't make sense to "transfer all HW metadata".
In fact we don't even want to read the hardware RH-hash, because we 
already know it is wrong (for tunnels), we just want to override the 
RX-hash used at SKB creation.  We do want the BPF programmers 
flexibility to call these kfuncs individually (when relevant).

- https://lore.kernel.org/netdev/20250616145523.63bd2577@xxxxxxxxxx/

I feel pressured into critiquing Jakub's suggestion, hope this is not 
too harsh.  First of all it is not relevant to our this patchset 
use-case, as it focus on all HW metadata.

Second, I disagree with the idea/mental model of storing in a 
"driver-specific format". The current implementation of driver-specific 
kfunc helpers that "get the metadata" is already doing a conversion to a 
common format, because the BPF-programmer naturally needs this to be the 
same across drivers.  Thus, it doesn't make sense to store it back in a 
"driver-specific format", as that just complicate things.  My mental 
model is thus, that after the driver-specific "get" operation to result 
is in a common format, that is simply defined by the struct type of the 
kfunc, which is both known by the kernel and BPF-prog.

--Jesper