Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Fri, 15 Aug 2025 18:40:21 +0200

On 13/08/2025 15.06, Jason Xing wrote:
On Wed, Aug 13, 2025 at 9:02 AM Jason Xing <kerneljasonxing@xxxxxxxxx> wrote:

On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
<maciej.fijalkowski@xxxxxxxxx> wrote:

On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:

On 11/08/2025 15.12, Jason Xing wrote:
From: Jason Xing <kernelxing@xxxxxxxxxxx>

Zerocopy mode has a good feature named multi buffer while copy mode has
to transmit skb one by one like normal flows. The latter might lose the
bypass power to some extent because of grabbing/releasing the same tx
queue lock and disabling/enabling bh and stuff on a packet basis.
Contending the same queue lock will bring a worse result.

I actually think that it is worth optimizing the non-zerocopy mode for
AF_XDP.  My use-case was virtual net_devices like veth.

This patch supports batch feature by permitting owning the queue lock to
send the generic_xmit_batch number of packets at one time. To further
achieve a better result, some codes[1] are removed on purpose from
xsk_direct_xmit_batch() as referred to __dev_direct_xmit().

[1]
1. advance the device check to granularity of sendto syscall.
2. remove validating packets because of its uselessness.
3. remove operation of softnet_data.xmit.recursion because it's not
     necessary.
4. remove BQL flow control. We don't need to do BQL control because it
     probably limit the speed. An ideal scenario is to use a standalone and
     clean tx queue to send packets only for xsk. Less competition shows
     better performance results.

Experiments:
1) Tested on virtio_net:

If you also want to test on veth, then an optimization is to increase
dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
AF_XDP packets getting reallocated by veth driver. I never completed
upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)

  [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org

(more below...)

With this patch series applied, the performance number of xdpsock[2] goes
up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
If we test with another thread competing the same queue, a 28% increase
(from 405466 pps to 521076 pps) can be observed.
2) Tested on ixgbe:
The results of zerocopy and copy mode are respectively 1303277 pps and
1187347 pps. After this socket option took effect, copy mode reaches
1472367 which was higher than zerocopy mode impressively.

[2]: ./xdpsock -i eth1 -t  -S -s 64

It's worth mentioning batch process might bring high latency in certain
cases. The recommended value is 32.

Given the issue I spotted on your ixgbe batching patch, the comparison
against zc performance is probably not reliable.

I have to clarify the thing: zc performance was tested without that
series applied. That means without that series, the number is 1303277
pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11  -z -s 64'.

My i40e device is running at 40Gbit/s.
I see significantly higher packet per sec (pps) that you are reporting:

$ sudo ./xdpsock -i i40e2 --txonly -q 2 -z -s 64

 sock0@i40e2:2 txonly xdp-drv
                   pps            pkts           1.00
rx                 0              0
tx                 21,546,859     21,552,896

The "copy" mode (-c/--copy) looks like this:

$ sudo ./xdpsock -i i40e2 --txonly -q 2 --copy -s 64

 sock0@i40e2:2 txonly xdp-drv
                   pps            pkts           1.00
rx                 0              0
tx                 2,135,424      2,136,192

The skb-mode (-S, --xdp-skb) looks like this:

$ sudo ./xdpsock -i i40e2 --txonly -q 2 --xdp-skb -s 64

 sock0@i40e2:2 txonly xdp-skb
                   pps            pkts           1.00
rx                 0              0
tx                 2,187,992      2,188,800

The HUGE performance gap to "xdp-drv" zero-copy mode, tells me that
there is a huge potential for improving the performance for the copy
mode, both "native" xdp-drv and xdp-skb, cases.
Thus, the work your are doing here is important.

------
@Maciej Fijalkowski
Interesting thing is that copy mode is way worse than zerocopy if the
nic is _i40e_.

With ixgbe, even copy mode reaches nearly 50-60% of the full speed
(which is only 1Gb/sec) while either zc or batch version copy mode
reaches 70%.
With i40e, copy mode only reaches nearly 9% while zc mode 70%. i40e
has a much faster speed (10Gb/sec) comparatively.

Here are some summaries (budget 32, batch 32):
                copy       batch         zc
i40e       1,777,859   2,102,579   14,880,643
ixgbe      1,187,347   1,472,367    1,303,277

(added thousands separators to make above readable)

Those number only make sense if
 i40e runs at link speed 10Git/s and
 ixgbe runs at link speed 1Gbit/s

For i40e, here are numbers around the batch copy mode (budget is 128):
no batch       batch 64
1825027      2228328.
Side note: 2228328 seems to be the max limit in copy mode with this
series applied after testing with different settings.

It turns out that testing on i40e is definitely needed because the
xdpsock test hits the bottleneck on ixgbe.

-----
@Jesper Dangaard Brouer
In terms of the 'more' boolean as Jesper said, related drivers might
need changes like this because in the 'for' loop of the batch process
in xsk_direct_xmit_batch(), drivers may fail to send and then break
the 'for' loop, which leads to no chance to kick the hardware.

If sending with 'more' indicator and driver fail to send, then it is the
responsibility of the driver to update the tail-ptr/doorbell.
Example for ixgbe driver:
 https://elixir.bootlin.com/linux/v6.16/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L8879-L8880

Or we
can keep trying to send in xsk_direct_xmit_batch() instead of breaking
immediately even if the driver becomes busy right now.

As to ixgbe, the performance doesn't improve as I analyzed (because it
already reaches 70% of full speed).

If ixgbe runs at 1Gbit/s then remember Ethernet wire-speed is 1.488
Mpps. Thus, you are much closer than 70% to wire-speed.

As to i40e, only adding 'more' logic, the number goes from 2102579 to
2585224 with the 32 batch and 32 budget settings. The number goes from
2200013 to 2697313 with the  batch and 64 budget settings. See! 22+%
improvement!

That is a very big performance gain IMHO. Shows that avoiding the tail-
ptr/doorbell between each packet have a huge benefit.

As to virtio_net, no obvious change here probably because the hardirq
logic doesn't have a huge effect.

Perhaps virtio_net doesn't implement the SKB 'more' feature?

--Jesper