Re: [PATCH v6 bpf] xsk: fix immature cq descriptor production

Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> · Tue, 26 Aug 2025 21:13:03 +0200

On Tue, Aug 26, 2025 at 09:03:45PM +0200, Maciej Fijalkowski wrote:
> On Tue, Aug 26, 2025 at 08:23:04PM +0200, Magnus Karlsson wrote:
> > On Tue, 26 Aug 2025 at 18:07, Jason Xing <kerneljasonxing@xxxxxxxxx> wrote:
> > >
> > > On Wed, Aug 20, 2025 at 11:49 PM Maciej Fijalkowski
> > > <maciej.fijalkowski@xxxxxxxxx> wrote:
> > > >
> > > > Eryk reported an issue that I have put under Closes: tag, related to
> > > > umem addrs being prematurely produced onto pool's completion queue.
> > > > Let us make the skb's destructor responsible for producing all addrs
> > > > that given skb used.
> > > >
> > > > Introduce struct xsk_addrs which will carry descriptor count with array
> > > > of addresses taken from processed descriptors that will be carried via
> > > > skb_shared_info::destructor_arg. This way we can refer to it within
> > > > xsk_destruct_skb(). In order to mitigate the overhead that will be
> > > > coming from memory allocations, let us introduce kmem_cache of
> > > > xsk_addrs. There will be a single kmem_cache for xsk generic xmit on the
> > > > system.
> > > >
> > > > Commit from fixes tag introduced the buggy behavior, it was not broken
> > > > from day 1, but rather when xsk multi-buffer got introduced.
> > > >
> > > > Fixes: b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
> > > > Reported-by: Eryk Kubanski <e.kubanski@xxxxxxxxxxxxxxxxxxx>
> > > > Closes: https://lore.kernel.org/netdev/20250530103456.53564-1-e.kubanski@xxxxxxxxxxxxxxxxxxx/
> > > > Acked-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
> > > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx>
> > > > ---
> > > >
> > > > v1:
> > > > https://lore.kernel.org/bpf/20250702101648.1942562-1-maciej.fijalkowski@xxxxxxxxx/
> > > > v2:
> > > > https://lore.kernel.org/bpf/20250705135512.1963216-1-maciej.fijalkowski@xxxxxxxxx/
> > > > v3:
> > > > https://lore.kernel.org/bpf/20250806154127.2161434-1-maciej.fijalkowski@xxxxxxxxx/
> > > > v4:
> > > > https://lore.kernel.org/bpf/20250813171210.2205259-1-maciej.fijalkowski@xxxxxxxxx/
> > > > v5:
> > > > https://lore.kernel.org/bpf/aKXBHGPxjpBDKOHq@boxer/T/
> > > >
> > > > v1->v2:
> > > > * store addrs in array carried via destructor_arg instead having them
> > > >   stored in skb headroom; cleaner and less hacky approach;
> > > > v2->v3:
> > > > * use kmem_cache for xsk_addrs allocation (Stan/Olek)
> > > > * set err when xsk_addrs allocation fails (Dan)
> > > > * change xsk_addrs layout to avoid holes
> > > > * free xsk_addrs on error path
> > > > * rebase
> > > > v3->v4:
> > > > * have kmem_cache as percpu vars
> > > > * don't drop unnecessary braces (unrelated) (Stan)
> > > > * use idx + i in xskq_prod_write_addr (Stan)
> > > > * alloc kmem_cache on bind (Stan)
> > > > * keep num_descs as first member in xsk_addrs (Magnus)
> > > > * add ack from Magnus
> > > > v4->v5:
> > > > * have a single kmem_cache per xsk subsystem (Stan)
> > > > v5->v6:
> > > > * free skb in xsk_build_skb_zerocopy() when xsk_addrs allocation fails
> > > >   (Stan)
> > > > * unregister netdev notifier if creating kmem_cache fails (Stan)
> > > >
> > > > ---
> > > >  net/xdp/xsk.c       | 95 +++++++++++++++++++++++++++++++++++++--------
> > > >  net/xdp/xsk_queue.h | 12 ++++++
> > > >  2 files changed, 91 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 9c3acecc14b1..989d5ffb4273 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -36,6 +36,13 @@
> > > >  #define TX_BATCH_SIZE 32
> > > >  #define MAX_PER_SOCKET_BUDGET 32
> > > >
> > > > +struct xsk_addrs {
> > > > +       u32 num_descs;
> > > > +       u64 addrs[MAX_SKB_FRAGS + 1];
> > > > +};
> > > > +
> > > > +static struct kmem_cache *xsk_tx_generic_cache;
> > >
> > > IMHO, adding a few heavy operations of allocating and freeing from
> > > cache in the hot path is not a good choice. What I've been trying so
> > > hard lately is to minimize the times of manipulating memory as much as
> > > possible :( Memory hotspot can be easily captured by perf.
> > >
> > > We might provide an new option in setsockopt() to let users
> > > specifically support this use case since it does harm to normal cases?
> > 
> > Agree with you that we should not harm the normal case here. Instead
> > of introducing a setsockopt, how about we detect the case when this
> > can happen in the code? If I remember correctly, it can only occur in
> > the XDP_SHARED_UMEM mode were the xsk pool is shared between
> > processes. If this can be tested (by introducing a new bit in the xsk
> > pool if that is necessary), we could have two potential skb
> > destructors: the old one for the "normal" case and the new one with
> > the list of addresses to complete (using the expensive allocations and
> > deallocations) when it is strictly required i.e., when the xsk pool is
> > shared. Maciej, you are more in to the details of this, so what do you
> > think? Would something like this be a potential path forward?
> 
> Meh, i was focused on 9k mtu impact, it was about 5% on my machine but now
> i checked small packets and indeed i see 12-14% perf regression.
> 
> I'll look into this so Daniel, for now let's drop this unfortunate
> patch...

One more thing - Jason, you still need to focus your work on this approach
where we produce cq entries from destructor. I just need to come up with
smarter way of producing descs to be consumed by destructor :<

> 
> > 
> > >
> > > > +
> > > >  void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool)
> > > >  {
> > > >         if (pool->cached_need_wakeup & XDP_WAKEUP_RX)
> > > > @@ -532,25 +539,39 @@ static int xsk_wakeup(struct xdp_sock *xs, u8 flags)
> > > >         return dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id, flags);
> > > >  }
> > > >
> > > > -static int xsk_cq_reserve_addr_locked(struct xsk_buff_pool *pool, u64 addr)
> > > > +static int xsk_cq_reserve_locked(struct xsk_buff_pool *pool)
> > > >  {
> > > >         unsigned long flags;
> > > >         int ret;
> > > >
> > > >         spin_lock_irqsave(&pool->cq_lock, flags);
> > > > -       ret = xskq_prod_reserve_addr(pool->cq, addr);
> > > > +       ret = xskq_prod_reserve(pool->cq);
> > > >         spin_unlock_irqrestore(&pool->cq_lock, flags);
> > > >
> > > >         return ret;
> > > >  }
> > > >
> > > > -static void xsk_cq_submit_locked(struct xsk_buff_pool *pool, u32 n)
> > > > +static void xsk_cq_submit_addr_locked(struct xdp_sock *xs,
> > > > +                                     struct sk_buff *skb)
> > > >  {
> > > > +       struct xsk_buff_pool *pool = xs->pool;
> > > > +       struct xsk_addrs *xsk_addrs;
> > > >         unsigned long flags;
> > > > +       u32 num_desc, i;
> > > > +       u32 idx;
> > > > +
> > > > +       xsk_addrs = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> > > > +       num_desc = xsk_addrs->num_descs;
> > > >
> > > >         spin_lock_irqsave(&pool->cq_lock, flags);
> > > > -       xskq_prod_submit_n(pool->cq, n);
> > > > +       idx = xskq_get_prod(pool->cq);
> > > > +
> > > > +       for (i = 0; i < num_desc; i++)
> > > > +               xskq_prod_write_addr(pool->cq, idx + i, xsk_addrs->addrs[i]);
> > > > +       xskq_prod_submit_n(pool->cq, num_desc);
> > > > +
> > > >         spin_unlock_irqrestore(&pool->cq_lock, flags);
> > > > +       kmem_cache_free(xsk_tx_generic_cache, xsk_addrs);
> > > >  }
> > > >
> > > >  static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
> > > > @@ -562,11 +583,6 @@ static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
> > > >         spin_unlock_irqrestore(&pool->cq_lock, flags);
> > > >  }
> > > >
> > > > -static u32 xsk_get_num_desc(struct sk_buff *skb)
> > > > -{
> > > > -       return skb ? (long)skb_shinfo(skb)->destructor_arg : 0;
> > > > -}
> > > > -
> > > >  static void xsk_destruct_skb(struct sk_buff *skb)
> > > >  {
> > > >         struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
> > > > @@ -576,21 +592,37 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > >                 *compl->tx_timestamp = ktime_get_tai_fast_ns();
> > > >         }
> > > >
> > > > -       xsk_cq_submit_locked(xdp_sk(skb->sk)->pool, xsk_get_num_desc(skb));
> > > > +       xsk_cq_submit_addr_locked(xdp_sk(skb->sk), skb);
> > > >         sock_wfree(skb);
> > > >  }
> > > >
> > > > -static void xsk_set_destructor_arg(struct sk_buff *skb)
> > > > +static u32 xsk_get_num_desc(struct sk_buff *skb)
> > > >  {
> > > > -       long num = xsk_get_num_desc(xdp_sk(skb->sk)->skb) + 1;
> > > > +       struct xsk_addrs *addrs;
> > > >
> > > > -       skb_shinfo(skb)->destructor_arg = (void *)num;
> > > > +       addrs = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> > > > +       return addrs->num_descs;
> > > > +}
> > > > +
> > > > +static void xsk_set_destructor_arg(struct sk_buff *skb, struct xsk_addrs *addrs)
> > > > +{
> > > > +       skb_shinfo(skb)->destructor_arg = (void *)addrs;
> > > > +}
> > > > +
> > > > +static void xsk_inc_skb_descs(struct sk_buff *skb)
> > > > +{
> > > > +       struct xsk_addrs *addrs;
> > > > +
> > > > +       addrs = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> > > > +       addrs->num_descs++;
> > > >  }
> > > >
> > > >  static void xsk_consume_skb(struct sk_buff *skb)
> > > >  {
> > > >         struct xdp_sock *xs = xdp_sk(skb->sk);
> > > >
> > > > +       kmem_cache_free(xsk_tx_generic_cache,
> > > > +                       (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg);
> > >
> > > Replying to Daniel here: when EOVERFLOW occurs, it will finally go to
> > > above function and clear the allocated memory and skb.
> > >
> > > >         skb->destructor = sock_wfree;
> > > >         xsk_cq_cancel_locked(xs->pool, xsk_get_num_desc(skb));
> > > >         /* Free skb without triggering the perf drop trace */
> > > > @@ -609,6 +641,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > >  {
> > > >         struct xsk_buff_pool *pool = xs->pool;
> > > >         u32 hr, len, ts, offset, copy, copied;
> > > > +       struct xsk_addrs *addrs = NULL;
> > >
> > > nit: no need to set to "NULL" at the begining.
> > >
> > > >         struct sk_buff *skb = xs->skb;
> > > >         struct page *page;
> > > >         void *buffer;
> > > > @@ -623,6 +656,14 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > >                         return ERR_PTR(err);
> > > >
> > > >                 skb_reserve(skb, hr);
> > > > +
> > > > +               addrs = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> > > > +               if (!addrs) {
> > > > +                       kfree(skb);
> > > > +                       return ERR_PTR(-ENOMEM);
> > > > +               }
> > > > +
> > > > +               xsk_set_destructor_arg(skb, addrs);
> > > >         }
> > > >
> > > >         addr = desc->addr;
> > > > @@ -662,6 +703,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >  {
> > > >         struct xsk_tx_metadata *meta = NULL;
> > > >         struct net_device *dev = xs->dev;
> > > > +       struct xsk_addrs *addrs = NULL;
> > > >         struct sk_buff *skb = xs->skb;
> > > >         bool first_frag = false;
> > > >         int err;
> > > > @@ -694,6 +736,15 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >                         err = skb_store_bits(skb, 0, buffer, len);
> > > >                         if (unlikely(err))
> > > >                                 goto free_err;
> > > > +
> > > > +                       addrs = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> > > > +                       if (!addrs) {
> > > > +                               err = -ENOMEM;
> > > > +                               goto free_err;
> > > > +                       }
> > > > +
> > > > +                       xsk_set_destructor_arg(skb, addrs);
> > > > +
> > > >                 } else {
> > > >                         int nr_frags = skb_shinfo(skb)->nr_frags;
> > > >                         struct page *page;
> > > > @@ -759,7 +810,9 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >         skb->mark = READ_ONCE(xs->sk.sk_mark);
> > > >         skb->destructor = xsk_destruct_skb;
> > > >         xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
> > > > -       xsk_set_destructor_arg(skb);
> > > > +
> > > > +       addrs = (struct xsk_addrs *)skb_shinfo(skb)->destructor_arg;
> > > > +       addrs->addrs[addrs->num_descs++] = desc->addr;
> > > >
> > > >         return skb;
> > > >
> > > > @@ -769,7 +822,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > >
> > > >         if (err == -EOVERFLOW) {
> > > >                 /* Drop the packet */
> > > > -               xsk_set_destructor_arg(xs->skb);
> > > > +               xsk_inc_skb_descs(xs->skb);
> > > >                 xsk_drop_skb(xs->skb);
> > > >                 xskq_cons_release(xs->tx);
> > > >         } else {
> > > > @@ -812,7 +865,7 @@ static int __xsk_generic_xmit(struct sock *sk)
> > > >                  * if there is space in it. This avoids having to implement
> > > >                  * any buffering in the Tx path.
> > > >                  */
> > > > -               err = xsk_cq_reserve_addr_locked(xs->pool, desc.addr);
> > > > +               err = xsk_cq_reserve_locked(xs->pool);
> > > >                 if (err) {
> > > >                         err = -EAGAIN;
> > > >                         goto out;
> > > > @@ -1815,8 +1868,18 @@ static int __init xsk_init(void)
> > > >         if (err)
> > > >                 goto out_pernet;
> > > >
> > > > +       xsk_tx_generic_cache = kmem_cache_create("xsk_generic_xmit_cache",
> > > > +                                                sizeof(struct xsk_addrs), 0,
> > > > +                                                SLAB_HWCACHE_ALIGN, NULL);
> > > > +       if (!xsk_tx_generic_cache) {
> > > > +               err = -ENOMEM;
> > > > +               goto out_unreg_notif;
> > > > +       }
> > > > +
> > > >         return 0;
> > > >
> > > > +out_unreg_notif:
> > > > +       unregister_netdevice_notifier(&xsk_netdev_notifier);
> > > >  out_pernet:
> > > >         unregister_pernet_subsys(&xsk_net_ops);
> > > >  out_sk:
> > > > diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> > > > index 46d87e961ad6..f16f390370dc 100644
> > > > --- a/net/xdp/xsk_queue.h
> > > > +++ b/net/xdp/xsk_queue.h
> > > > @@ -344,6 +344,11 @@ static inline u32 xskq_cons_present_entries(struct xsk_queue *q)
> > > >
> > > >  /* Functions for producers */
> > > >
> > > > +static inline u32 xskq_get_prod(struct xsk_queue *q)
> > > > +{
> > > > +       return READ_ONCE(q->ring->producer);
> > > > +}
> > > > +
> > > >  static inline u32 xskq_prod_nb_free(struct xsk_queue *q, u32 max)
> > > >  {
> > > >         u32 free_entries = q->nentries - (q->cached_prod - q->cached_cons);
> > > > @@ -390,6 +395,13 @@ static inline int xskq_prod_reserve_addr(struct xsk_queue *q, u64 addr)
> > > >         return 0;
> > > >  }
> > > >
> > > > +static inline void xskq_prod_write_addr(struct xsk_queue *q, u32 idx, u64 addr)
> > > > +{
> > > > +       struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
> > > > +
> > > > +       ring->desc[idx & q->ring_mask] = addr;
> > > > +}
> > > > +
> > > >  static inline void xskq_prod_write_addr_batch(struct xsk_queue *q, struct xdp_desc *descs,
> > > >                                               u32 nb_entries)
> > > >  {
> > > > --
> > > > 2.34.1
> > > >
> > > >
> > >