Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode

Jason Xing <kerneljasonxing@xxxxxxxxx> · Wed, 13 Aug 2025 09:02:21 +0800

On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
<maciej.fijalkowski@xxxxxxxxx> wrote:
>
> On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
> >
> >
> > On 11/08/2025 15.12, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@xxxxxxxxxxx>
> > >
> > > Zerocopy mode has a good feature named multi buffer while copy mode has
> > > to transmit skb one by one like normal flows. The latter might lose the
> > > bypass power to some extent because of grabbing/releasing the same tx
> > > queue lock and disabling/enabling bh and stuff on a packet basis.
> > > Contending the same queue lock will bring a worse result.
> > >
> >
> > I actually think that it is worth optimizing the non-zerocopy mode for
> > AF_XDP.  My use-case was virtual net_devices like veth.
> >
> >
> > > This patch supports batch feature by permitting owning the queue lock to
> > > send the generic_xmit_batch number of packets at one time. To further
> > > achieve a better result, some codes[1] are removed on purpose from
> > > xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> > >
> > > [1]
> > > 1. advance the device check to granularity of sendto syscall.
> > > 2. remove validating packets because of its uselessness.
> > > 3. remove operation of softnet_data.xmit.recursion because it's not
> > >     necessary.
> > > 4. remove BQL flow control. We don't need to do BQL control because it
> > >     probably limit the speed. An ideal scenario is to use a standalone and
> > >     clean tx queue to send packets only for xsk. Less competition shows
> > >     better performance results.
> > >
> > > Experiments:
> > > 1) Tested on virtio_net:
> >
> > If you also want to test on veth, then an optimization is to increase
> > dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
> > AF_XDP packets getting reallocated by veth driver. I never completed
> > upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
> >
> >  [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
> >
> >
> > (more below...)
> >
> > > With this patch series applied, the performance number of xdpsock[2] goes
> > > up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> > > If we test with another thread competing the same queue, a 28% increase
> > > (from 405466 pps to 521076 pps) can be observed.
> > > 2) Tested on ixgbe:
> > > The results of zerocopy and copy mode are respectively 1303277 pps and
> > > 1187347 pps. After this socket option took effect, copy mode reaches
> > > 1472367 which was higher than zerocopy mode impressively.
> > >
> > > [2]: ./xdpsock -i eth1 -t  -S -s 64
> > >
> > > It's worth mentioning batch process might bring high latency in certain
> > > cases. The recommended value is 32.
>
> Given the issue I spotted on your ixgbe batching patch, the comparison
> against zc performance is probably not reliable.

I have to clarify the thing: zc performance was tested without that
series applied. That means without that series, the number is 1303277
pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11  -z -s 64'.

>
> > >
> > > Signed-off-by: Jason Xing <kernelxing@xxxxxxxxxxx>
> > > ---
> > >   include/linux/netdevice.h |   2 +
> > >   net/core/dev.c            |  18 +++++++
> > >   net/xdp/xsk.c             | 103 ++++++++++++++++++++++++++++++++++++--
> > >   3 files changed, 120 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 5e5de4b0a433..27738894daa7 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -3352,6 +3352,8 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
> > >   int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev);
> > >   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
> > > +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> > > +                     struct netdev_queue *txq, u32 max_batch, u32 *cur);
> > >   static inline int dev_queue_xmit(struct sk_buff *skb)
> > >   {
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 68dc47d7e700..7a512bd38806 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4742,6 +4742,24 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > >   }
> > >   EXPORT_SYMBOL(__dev_queue_xmit);
> > > +int xsk_direct_xmit_batch(struct sk_buff **skb, struct net_device *dev,
> > > +                     struct netdev_queue *txq, u32 max_batch, u32 *cur)
> > > +{
> > > +   int ret = NETDEV_TX_BUSY;
> > > +
> > > +   local_bh_disable();
> > > +   HARD_TX_LOCK(dev, txq, smp_processor_id());
> > > +   for (; *cur < max_batch; (*cur)++) {
> > > +           ret = netdev_start_xmit(skb[*cur], dev, txq, false);
> >
> > The last argument ('false') to netdev_start_xmit() indicate if there are
> > 'more' packets to be sent. This allows the NIC driver to postpone
> > writing the tail-pointer/doorbell. For physical hardware this is a large
> > performance gain.
> >
> > If index have not reached 'max_batch' then we know 'more' packets are true.
> >
> >   bool more = !!(*cur != max_batch);
> >
> > Can I ask you to do a test with netdev_start_xmit() using the 'more' boolian
> > ?
> >
> >
> > > +           if (ret != NETDEV_TX_OK)
> > > +                   break;
> > > +   }
> > > +   HARD_TX_UNLOCK(dev, txq);
> > > +   local_bh_enable();
> > > +
> > > +   return ret;
> > > +}
> > > +
> > >   int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
> > >   {
> > >     struct net_device *dev = skb->dev;
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 7a149f4ac273..92ad82472776 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -780,9 +780,102 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >     return ERR_PTR(err);
> > >   }
> > > -static int __xsk_generic_xmit(struct sock *sk)
> > > +static int __xsk_generic_xmit_batch(struct xdp_sock *xs)
> > > +{
> > > +   u32 max_batch = READ_ONCE(xs->generic_xmit_batch);
> > > +   struct sk_buff **skb = xs->skb_batch;
> > > +   struct net_device *dev = xs->dev;
> > > +   struct netdev_queue *txq;
> > > +   bool sent_frame = false;
> > > +   struct xdp_desc desc;
> > > +   u32 i = 0, j = 0;
> > > +   u32 max_budget;
> > > +   int err = 0;
> > > +
> > > +   mutex_lock(&xs->mutex);
> > > +
> > > +   /* Since we dropped the RCU read lock, the socket state might have changed. */
> > > +   if (unlikely(!xsk_is_bound(xs))) {
> > > +           err = -ENXIO;
> > > +           goto out;
> > > +   }
> > > +
> > > +   if (xs->queue_id >= dev->real_num_tx_queues)
> > > +           goto out;
> > > +
> > > +   if (unlikely(!netif_running(dev) ||
> > > +                !netif_carrier_ok(dev)))
> > > +           goto out;
> > > +
> > > +   max_budget = READ_ONCE(xs->max_tx_budget);
> > > +   txq = netdev_get_tx_queue(dev, xs->queue_id);
> > > +   do {
> > > +           for (; i < max_batch && xskq_cons_peek_desc(xs->tx, &desc, xs->pool); i++) {
>
> here we should think how to come up with slightly modified version of
> xsk_tx_peek_release_desc_batch() for generic xmit needs, or what could we
> borrow from this approach that will be applicable here.

Okay, I will dig into it more. Thanks.

>
> > > +                   if (max_budget-- == 0) {
> > > +                           err = -EAGAIN;
> > > +                           break;
> > > +                   }
> > > +                   /* This is the backpressure mechanism for the Tx path.
> > > +                    * Reserve space in the completion queue and only proceed
> > > +                    * if there is space in it. This avoids having to implement
> > > +                    * any buffering in the Tx path.
> > > +                    */
> > > +                   err = xsk_cq_reserve_addr_locked(xs->pool, desc.addr);
> > > +                   if (err) {
> > > +                           err = -EAGAIN;
> > > +                           break;
> > > +                   }
> > > +
> > > +                   skb[i] = xsk_build_skb(xs, &desc);
> >
> > There is a missed opportunity for bulk allocating the SKBs here
> > (via kmem_cache_alloc_bulk).
>
> +1
>
> >
> > But this also requires changing the SKB alloc function used by
> > xsk_build_skb(). As a seperate patch, I recommend that you change the
> > sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
> > I expect this will be a large performance improvement on it's own.
> > Can I ask you to benchmark this change before the batch xmit change?
> >
> > Opinions needed from other maintainers please (I might be wrong!):
> > I don't think the socket level accounting done in sock_alloc_send_skb()
> > is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
> > code comment above.
>
> Thanks for bringing this up, I had the same feeling.
>
> >
> > --Jesper
> >
> > > +                   if (IS_ERR(skb[i])) {
> > > +                           err = PTR_ERR(skb[i]);
> > > +                           break;
> > > +                   }
> > > +
> > > +                   xskq_cons_release(xs->tx);
> > > +
> > > +                   if (xp_mb_desc(&desc))
> > > +                           xs->skb = skb[i];
> > > +           }
> > > +
> > > +           if (i) {
> > > +                   err = xsk_direct_xmit_batch(skb, dev, txq, i, &j);
> > > +                   if  (err == NETDEV_TX_BUSY) {
> > > +                           err = -EAGAIN;
> > > +                   } else if (err == NET_XMIT_DROP) {
> > > +                           j++;
> > > +                           err = -EBUSY;
> > > +                   }
> > > +
> > > +                   sent_frame = true;
> > > +                   xs->skb = NULL;
> > > +           }
> > > +
> > > +           if (err)
> > > +                   goto out;
> > > +           i = j = 0;
> > > +   } while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool));
>
> from the quick glance i don't follow why you have this call here whilst
> having it above in the while loop.

Because it avoids the first unnecessary xskq_cons_peek_desc() check.

>
> BTW do we have something bulk skb freeing in the kernel? given we're gonna
> eventually do kmem_cache_alloc_bulk for skbs then could we do
> kmem_cache_free_bulk() as well?

Good point. Let me deal with it :)

Thanks,
Jason

>
> > > +
> > > +   if (xskq_has_descs(xs->tx)) {
> > > +           if (xs->skb)
> > > +                   xsk_drop_skb(xs->skb);
> > > +           xskq_cons_release(xs->tx);
> > > +   }
> > > +
> > > +out:
> > > +   for (; j < i; j++) {
> > > +           xskq_cons_cancel_n(xs->tx, xsk_get_num_desc(skb[j]));
> > > +           xsk_consume_skb(skb[j]);
> > > +   }
> > > +   if (sent_frame)
> > > +           __xsk_tx_release(xs);
> > > +
> > > +   mutex_unlock(&xs->mutex);
> > > +   return err;
> > > +}
> > > +
> > > +static int __xsk_generic_xmit(struct xdp_sock *xs)
> > >   {
> > > -   struct xdp_sock *xs = xdp_sk(sk);
> > >     bool sent_frame = false;
> > >     struct xdp_desc desc;
> > >     struct sk_buff *skb;
> > > @@ -871,11 +964,15 @@ static int __xsk_generic_xmit(struct sock *sk)
> > >   static int xsk_generic_xmit(struct sock *sk)
> > >   {
> > > +   struct xdp_sock *xs = xdp_sk(sk);
> > >     int ret;
> > >     /* Drop the RCU lock since the SKB path might sleep. */
> > >     rcu_read_unlock();
> > > -   ret = __xsk_generic_xmit(sk);
> > > +   if (READ_ONCE(xs->generic_xmit_batch))
> > > +           ret = __xsk_generic_xmit_batch(xs);
> > > +   else
> > > +           ret = __xsk_generic_xmit(xs);
> > >     /* Reaquire RCU lock before going into common code. */
> > >     rcu_read_lock();
> >
> >