Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Thu, 7 Aug 2025 20:26:53 +0200

On 06/08/2025 02.28, Jakub Kicinski wrote:
On Mon, 4 Aug 2025 15:18:35 +0200 Jesper Dangaard Brouer wrote:
On 01/08/2025 22.38, Jakub Kicinski wrote:
On Thu, 31 Jul 2025 18:27:07 +0200 Jesper Dangaard Brouer wrote:
I have strong reservations about having the BPF program itself trigger
the SKB allocation. I believe this would fundamentally break the
performance model that makes cpumap redirect so effective.

See, I have similar concerns about growing struct xdp_frame.

IMHO there is a huge difference in doing memory allocs+init vs. growing
struct xdp_frame.

It very is important to notice that patchset is actually not growing
xdp_frame, in the traditional sense, instead we are adding an optional
area to xdp_frame (plus some flags to tell if area is in-use).  Remember
the xdp_frame area is not allocated or mem-zeroed (except flags).  If
not used, the members in struct xdp_rx_meta are never touched.

Yes, I get all that.

Thus, there is actually no performance impact in growing struct
xdp_frame in this way. Do you still have concerns?

You're adding code in a number of paths, I don't think it's fair to
claim that there is *no* performance impact. Maybe no impact of
XDP_DROP from the patches themselves, assuming driver doesn't
pre-populate.

I feel a need to state this.  The purpose of this patchset is to
increase the performance, by providing access to offload hints.  The
common theme is that these offload hints can help us avoid data cache-
misses and/or avoid some steps in software.  Setting VLAN
(__vlan_hwaccel_put_tag) avoids an extra trip through the RX-handler.
Setting RX-hash avoids this calling into flow_dissector to SW calc. The
RX-hash is an extended type, that already knows the packet type IPv4/
IPv6 and UDP/TCP, thus it can simplify BPF-code needed.  We are missing 
checksum offload, as Lorenzo mentions, but that is the plan, as it has 
the most gain (for TCP csum_partial is in perf top for cpumap).

Do you have any idea how well this approach will scale to all the fields
people will need in the future to xdp_frame? The nice thing about the
SET ops is that the driver can define whatever ops it supports,
including things not supported by skb (or supported thru skb_ext),
at zero cost to the common stack. If we define the fields in the core
we're back to the inflexibility of the skb world..

This is where traits come in. For now the struct has static members, but
we want to convert this to a dynamic struct based on traits, if demand
for more members is proposed. The patchset API allows us to change to
this approach later.

The SET ops API requires two XDP program, one at physical NIC, and one 
at veth, and agreement for side-band layout for transferring e.g RX-hash 
and timestamp.  The veth XDP-prog need to run on the peer-device, for 
containers this is the veth device inside the container. If veth 
XDP-prog doesn't clear data_meta area, then GRO aggregation breaks (e.g 
for timestamp usage).
Good luck getting this to work for containers.

Our solution works out-of-the-box for containers. We only need one 
XDP-prog on at physical NIC, that will supply missing hardware offload 
for XDP-redirect into the veth device.

That's why the guiding principle for me would be to make sure that
the features we add, beyond "classic XDP" as needed by DDoS, are
entirely optional.

Exactly, we agree.  What we do in this patchset is entirely optional.
These changes does not slowdown "classic XDP" and our DDoS use-case.

And if we include the goal of moving skb allocation
out of the driver to the xdp_frame growth, the drivers will sooner or
later unconditionally populate the xdp_frame. Decreasing performance
of "classic XDP"?

No, that is the beauty of this solution, it will not decrease the
performance of "classic XDP".

Do keep-in-mind that "moving skb allocation out of the driver" is not
part of this patchset and a moonshot goal that will take a long time
(but we are already "simulation" this via XDP-redirect for years now).
Drivers should obviously not unconditionally populate the xdp_frame's
rx_meta area.  It is first time to populate rx_meta, once driver reach
XDP_PASS case (normal netstack delivery). Today all drivers will at this
stage populate the SKB metadata (e.g. rx-hash + vlan) from the RX-
descriptor anyway.  Thus, I don't see how replacing those writes will
decrease performance.

I don't think it's at all obvious that the driver should not
unconditionally populate the xdp_frame.It seems like the logical
direction to me, TBH. Driver pre-populates, then the conversion
and GET callbacks become trivial and generic..

This is related to when cache-lines are ready.  All XDP drivers
prefetchw the xdp_frame area before starting XDP-prog.  Thus, driver
want to delay writing until this cache-line is ready.

Perhaps we should try to convert a real driver in this series.

What do you mean?
This series is about the XDP_REDIRECT case, so we don't need to modify
any physical NIC driver.

Do you want this series to include the ability to XDP-override the
hardware offloads for RX-hash and VLAN for the XDP_PASS case for a real
physical NIC driver?

The key to XDP's high performance lies in processing a bulk of
xdp_frames in a tight loop to amortize costs. The existing cpumap code
on the remote CPU is already highly optimized for this: it performs bulk
allocation of SKBs and uses careful prefetching to hide the memory
latency. Allowing a BPF program to sometimes trigger a heavyweight SKB
alloc+init (4 cache-line misses) would bypass all these existing
optimizations. It would introduce significant jitter into the pipeline
and disrupt the entire bulk-processing model we rely on for performance.

This performance is not just theoretical;

Somewhat off-topic for the architecture, I think, but do you happen
to have any real life data for that? IIRC the "listification" was a
moderate success for the skb path.. Or am I misreading and you have
other benefits of a tight processing loop in mind?

Our "tight processing loop" for NAPI (net_rx_action/napi_pool) is not
performing as well as we want. One major reason is that the CPU is being
stalled each time in the loop when the NIC driver needs to clear the 4
cache-lines for the SKB.  XDP have shown us that avoiding these steps is
a huge performance boost.

Do you know what uarch resource it's stalling on?
It's been on my minder whether in the attempts to zero out as
little as possible we didn't defeat CPU optimization for clearing
full cache lines.

The main performance stall problem with zeroing is the 'rep
stos' (repeated string store) operation.
See comments in this memset micro benchmark [1]
 [1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c

Memset of 32 bytes (or less) will result in MOVQ instructions, which is
really fast. Large sizes the compiler usually creates 'rep stos'
instructions, which have high "startup" cost.

--Jesper