On 29/07/2025 21.47, Martin KaFai Lau wrote:
On 7/29/25 4:15 AM, Jesper Dangaard Brouer wrote:
That idea has been considered before, but it unfortunately doesn't work
from a performance angle. The performance model of XDP_REDIRECT into
CPUMAP relies on moving the expensive SKB allocation+init to a remote
CPU. This keeps the ingress CPU free to process packets at near line
rate (our DDoS use-case). If we allocate the SKB on the ingress-CPU
before the redirect, we destroy this load-balancing model and create the
exact bottleneck we designed CPUMAP to avoid.
iirc, a xdp prog can be attached to a cpumap. The skb can be created by
that xdp prog running on the remote cpu. It should be like a xdp prog
returning a XDP_PASS + an optional skb. The xdp prog can set some fields
in the skb. Other than setting fields in the skb, something else may be
also possible in the future, e.g. look up sk, earlier demux ...etc.
I have strong reservations about having the BPF program itself trigger
the SKB allocation. I believe this would fundamentally break the
performance model that makes cpumap redirect so effective.
The key to XDP's high performance lies in processing a bulk of
xdp_frames in a tight loop to amortize costs. The existing cpumap code
on the remote CPU is already highly optimized for this: it performs bulk
allocation of SKBs and uses careful prefetching to hide the memory
latency. Allowing a BPF program to sometimes trigger a heavyweight SKB
alloc+init (4 cache-line misses) would bypass all these existing
optimizations. It would introduce significant jitter into the pipeline
and disrupt the entire bulk-processing model we rely on for performance.
This performance is not just theoretical; we rely on it for DDoS
protection. For example, our plan is to use the XDP program on the
cpumap hook to run secondary DDoS mitigation rules that currently use
iptables (funny, many rules are actually BPF program snippets today).
Architecturally, there is a clean separation today: the BPF program
makes a decision, and the highly-optimized cpumap or core kernel code
acts on it (build_skb, napi_gro_receive, etc). Your proposal blurs this
line significantly. Our patch, in contrast, preserves this model. It
simply provides the necessary data (the hash, vlan and timestamp) to the
existing cpumap/veth skb path via the xdp_frame.
While more advanced capabilities are an interesting topic for the
future, my goal here is to solve the immediate, concrete problem of
transferring metadata cleanly, without disrupting the performance
architecture we rely on for use cases like DDoS mitigation.
--Jesper