Re: [PATCH v6 bpf] xsk: fix immature cq descriptor production

Jason Xing <kerneljasonxing@xxxxxxxxx> · Wed, 27 Aug 2025 08:25:26 +0800

On Wed, Aug 27, 2025 at 3:13 AM Maciej Fijalkowski
<maciej.fijalkowski@xxxxxxxxx> wrote:
>
> On Tue, Aug 26, 2025 at 09:03:45PM +0200, Maciej Fijalkowski wrote:
> > On Tue, Aug 26, 2025 at 08:23:04PM +0200, Magnus Karlsson wrote:
> > > On Tue, 26 Aug 2025 at 18:07, Jason Xing <kerneljasonxing@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, Aug 20, 2025 at 11:49 PM Maciej Fijalkowski
> > > > <maciej.fijalkowski@xxxxxxxxx> wrote:
> > > > >
> > > > > Eryk reported an issue that I have put under Closes: tag, related to
> > > > > umem addrs being prematurely produced onto pool's completion queue.
> > > > > Let us make the skb's destructor responsible for producing all addrs
> > > > > that given skb used.
> > > > >
> > > > > Introduce struct xsk_addrs which will carry descriptor count with array
> > > > > of addresses taken from processed descriptors that will be carried via
> > > > > skb_shared_info::destructor_arg. This way we can refer to it within
> > > > > xsk_destruct_skb(). In order to mitigate the overhead that will be
> > > > > coming from memory allocations, let us introduce kmem_cache of
> > > > > xsk_addrs. There will be a single kmem_cache for xsk generic xmit on the
> > > > > system.
> > > > >
> > > > > Commit from fixes tag introduced the buggy behavior, it was not broken
> > > > > from day 1, but rather when xsk multi-buffer got introduced.
> > > > >
> > > > > Fixes: b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
> > > > > Reported-by: Eryk Kubanski <e.kubanski@xxxxxxxxxxxxxxxxxxx>
> > > > > Closes: https://lore.kernel.org/netdev/20250530103456.53564-1-e.kubanski@xxxxxxxxxxxxxxxxxxx/
> > > > > Acked-by: Magnus Karlsson <magnus.karlsson@xxxxxxxxx>
> > > > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx>
> > > > > ---
> > > > >
> > > > > v1:
> > > > > https://lore.kernel.org/bpf/20250702101648.1942562-1-maciej.fijalkowski@xxxxxxxxx/
> > > > > v2:
> > > > > https://lore.kernel.org/bpf/20250705135512.1963216-1-maciej.fijalkowski@xxxxxxxxx/
> > > > > v3:
> > > > > https://lore.kernel.org/bpf/20250806154127.2161434-1-maciej.fijalkowski@xxxxxxxxx/
> > > > > v4:
> > > > > https://lore.kernel.org/bpf/20250813171210.2205259-1-maciej.fijalkowski@xxxxxxxxx/
> > > > > v5:
> > > > > https://lore.kernel.org/bpf/aKXBHGPxjpBDKOHq@boxer/T/
> > > > >
> > > > > v1->v2:
> > > > > * store addrs in array carried via destructor_arg instead having them
> > > > >   stored in skb headroom; cleaner and less hacky approach;
> > > > > v2->v3:
> > > > > * use kmem_cache for xsk_addrs allocation (Stan/Olek)
> > > > > * set err when xsk_addrs allocation fails (Dan)
> > > > > * change xsk_addrs layout to avoid holes
> > > > > * free xsk_addrs on error path
> > > > > * rebase
> > > > > v3->v4:
> > > > > * have kmem_cache as percpu vars
> > > > > * don't drop unnecessary braces (unrelated) (Stan)
> > > > > * use idx + i in xskq_prod_write_addr (Stan)
> > > > > * alloc kmem_cache on bind (Stan)
> > > > > * keep num_descs as first member in xsk_addrs (Magnus)
> > > > > * add ack from Magnus
> > > > > v4->v5:
> > > > > * have a single kmem_cache per xsk subsystem (Stan)
> > > > > v5->v6:
> > > > > * free skb in xsk_build_skb_zerocopy() when xsk_addrs allocation fails
> > > > >   (Stan)
> > > > > * unregister netdev notifier if creating kmem_cache fails (Stan)
> > > > >
> > > > > ---
> > > > >  net/xdp/xsk.c       | 95 +++++++++++++++++++++++++++++++++++++--------
> > > > >  net/xdp/xsk_queue.h | 12 ++++++
> > > > >  2 files changed, 91 insertions(+), 16 deletions(-)
> > > > >
> > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > index 9c3acecc14b1..989d5ffb4273 100644
> > > > > --- a/net/xdp/xsk.c
> > > > > +++ b/net/xdp/xsk.c
> > > > > @@ -36,6 +36,13 @@
> > > > >  #define TX_BATCH_SIZE 32
> > > > >  #define MAX_PER_SOCKET_BUDGET 32
> > > > >
> > > > > +struct xsk_addrs {
> > > > > +       u32 num_descs;
> > > > > +       u64 addrs[MAX_SKB_FRAGS + 1];
> > > > > +};
> > > > > +
> > > > > +static struct kmem_cache *xsk_tx_generic_cache;
> > > >
> > > > IMHO, adding a few heavy operations of allocating and freeing from
> > > > cache in the hot path is not a good choice. What I've been trying so
> > > > hard lately is to minimize the times of manipulating memory as much as
> > > > possible :( Memory hotspot can be easily captured by perf.
> > > >
> > > > We might provide an new option in setsockopt() to let users
> > > > specifically support this use case since it does harm to normal cases?
> > >
> > > Agree with you that we should not harm the normal case here. Instead
> > > of introducing a setsockopt, how about we detect the case when this
> > > can happen in the code? If I remember correctly, it can only occur in
> > > the XDP_SHARED_UMEM mode were the xsk pool is shared between
> > > processes. If this can be tested (by introducing a new bit in the xsk
> > > pool if that is necessary), we could have two potential skb
> > > destructors: the old one for the "normal" case and the new one with
> > > the list of addresses to complete (using the expensive allocations and
> > > deallocations) when it is strictly required i.e., when the xsk pool is
> > > shared. Maciej, you are more in to the details of this, so what do you
> > > think? Would something like this be a potential path forward?
> >
> > Meh, i was focused on 9k mtu impact, it was about 5% on my machine but now
> > i checked small packets and indeed i see 12-14% perf regression.
> >
> > I'll look into this so Daniel, for now let's drop this unfortunate
> > patch...
>
> One more thing - Jason, you still need to focus your work on this approach
> where we produce cq entries from destructor. I just need to come up with
> smarter way of producing descs to be consumed by destructor :<

No problem. Before getting to that batch feature, I'm dealing with
AF_PACKET implementation first which probably takes much time than
needed.

Thanks,
Jason