Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part

Christoph Paasch <cpaasch@xxxxxxxxxx> · Wed, 3 Sep 2025 20:58:22 -0700

On Wed, Sep 3, 2025 at 5:12 PM Amery Hung <ameryhung@xxxxxxxxx> wrote:
>
> On Wed, Sep 3, 2025 at 4:57 PM Christoph Paasch <cpaasch@xxxxxxxxxx> wrote:
> >
> > On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@xxxxxxxxx> wrote:
> > >
> > >
> > >
> > > On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> > > > From: Christoph Paasch <cpaasch@xxxxxxxxxx>
> > > >
> > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > > include part of the payload.
> > > >
> > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > > frag_list.
> > > >
> > > > This is of course not good when we are CPU-limited. Also causes a worse
> > > > skb->len/truesize ratio,...
> > > >
> > > > So, let's avoid copying parts of the payload to the linear part. We use
> > > > eth_get_headlen() to parse the headers and compute the length of the
> > > > protocol headers, which will be used to copy the relevant bits ot the
> > > > skb's linear part.
> > > >
> > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > > memory.
> > > >
> > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > > LRO enabled):
> > > >
> > > > BEFORE:
> > > > =======
> > > > (netserver pinned to core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > >   87380  16384 262144    60.01    32547.82
> > > >
> > > > (netserver pinned to adjacent core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > >   87380  16384 262144    60.00    52531.67
> > > >
> > > > AFTER:
> > > > ======
> > > > (netserver pinned to core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > >   87380  16384 262144    60.00    52896.06
> > > >
> > > > (netserver pinned to adjacent core receiving interrupts)
> > > >   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > >   87380  16384 262144    60.00    85094.90
> > > >
> > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > > better performance with this patch.
> > > >
> > > > Signed-off-by: Christoph Paasch <cpaasch@xxxxxxxxxx>
> > > > ---
> > > >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > > >   1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > >               dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > > >                                       rq->buff.map_dir);
> > > >
> > > > +             headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> > > > +
> > >
> > > Hi,
> > >
> > > I am building on top of this patchset and got a kernel crash. It was
> > > triggered by attaching an xdp program.
> > >
> > > I think the problem is skb->dev is still NULL here. It will be set later by:
> > > mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()
> >
> > Hmmm... Not sure what happened here...
> > I'm almost certain I tested with xdp as well...
> >
> > I will try again later/tomorrow.
> >
>
> Here is the command that triggers the panic:
>
> ip link set dev eth0 mtu 8000 xdp obj
> /root/ksft-net-drv/net/lib/xdp_native.bpf.o sec xdp.frags
>
> and I should have attached the log:
>
> [ 2851.287387] BUG: kernel NULL pointer dereference, address: 0000000000000100
> [ 2851.301329] #PF: supervisor read access in kernel mode
> [ 2851.311602] #PF: error_code(0x0000) - not-present page
> [ 2851.321879] PGD 0 P4D 0
> [ 2851.326944] Oops: Oops: 0000 [#1] SMP
> [ 2851.334272] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: loaded
> Tainted: G S          E       6.17.0-rc1-gcf50ef415525 #305 NONE
> [ 2851.357759] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> [ 2851.369252] Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1,
> BIOS Y3DL401 09/04/2024
> [ 2851.385787] RIP: 0010:eth_get_headlen+0x16/0x90
> [ 2851.394850] Code: 5e 41 5f 5d c3 b8 f2 ff ff ff eb f0 cc cc cc cc
> cc cc cc cc 0f 1f 44 00 00 41 56 53 48 83 ec 10 89 d3 83 fa 0e 72 68
> 49 89 f6 <48> 8b bf 00 01 00 00 44 0f b7 4e 0c c7 44 24 08 00 00 00 00
> 48 c7
> [ 2851.432413] RSP: 0018:ffffc90000720cc8 EFLAGS: 00010212
> [ 2851.442864] RAX: 0000000000000000 RBX: 000000000000008a RCX: 00000000000000a0
> [ 2851.457141] RDX: 000000000000008a RSI: ffff8885a5aee100 RDI: 0000000000000000
> [ 2851.471417] RBP: ffff8883d01f3900 R08: ffff888204c7c000 R09: 0000000000000000
> [ 2851.485696] R10: ffff8883d01f3900 R11: ffff8885a5aee340 R12: ffff8885add00030
> [ 2851.499969] R13: ffff8885add00030 R14: ffff8885a5aee100 R15: 0000000000000000
> [ 2851.514245] FS:  0000000000000000(0000) GS:ffff8890b4427000(0000)
> knlGS:0000000000000000
> [ 2851.530433] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2851.541931] CR2: 0000000000000100 CR3: 000000107d412003 CR4: 00000000007726f0
> [ 2851.556208] PKRU: 55555554
> [ 2851.561623] Call Trace:
> [ 2851.566514]  <IRQ>
> [ 2851.570540]  mlx5e_skb_from_cqe_mpwrq_nonlinear+0x7af/0x8d0
> [ 2851.581689]  mlx5e_handle_rx_cqe_mpwrq+0xbc/0x180
> [ 2851.591096]  mlx5e_poll_rx_cq+0x2ef/0x780
> [ 2851.599114]  mlx5e_napi_poll+0x10c/0x710
> [ 2851.606959]  __napi_poll+0x28/0x160
> [ 2851.613934]  net_rx_action+0x1c0/0x350
> [ 2851.621434]  ? mlx5_eq_comp_int+0xdf/0x190
> [ 2851.629628]  ? sched_clock+0x5/0x10
> [ 2851.636603]  ? sched_clock_cpu+0xc/0x170
> [ 2851.644450]  handle_softirqs+0xd8/0x280
> [ 2851.652121]  __irq_exit_rcu.llvm.7416059615185659459+0x44/0xd0
> [ 2851.663788]  common_interrupt+0x85/0x90
> [ 2851.671457]  </IRQ>
> [ 2851.675653]  <TASK>
> [ 2851.679850]  asm_common_interrupt+0x22/0x40

Oh, I see why I didn't hit the bug when testing with xdp... I wasn't
using a multi-buffer xdp prog and thus had to reduce the MTU and so
ended up not using the mlx5e_skb_from_cqe_mpwrq_nonlinear()
code-path...

I can reproduce the panic and will fix it.

Christoph

>
> Thanks for taking a look!
> Amery
>
> > Thanks!
> > Christoph
> >
> > >
> > >
> > > >               frag_offset += headlen;
> > > >               byte_cnt -= headlen;
> > > >               linear_hr = skb_headroom(skb);
> > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > >                               pagep->frags++;
> > > >                       while (++pagep < frag_page);
> > > >               }
> > > > +
> > > > +             headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> > > > +
> > > >               __pskb_pull_tail(skb, headlen);
> > > >       } else {
> > > >               if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > > >
> > >