On Wed, Sep 3, 2025 at 5:12 PM Amery Hung <ameryhung@xxxxxxxxx> wrote: > > On Wed, Sep 3, 2025 at 4:57 PM Christoph Paasch <cpaasch@xxxxxxxxxx> wrote: > > > > On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@xxxxxxxxx> wrote: > > > > > > > > > > > > On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote: > > > > From: Christoph Paasch <cpaasch@xxxxxxxxxx> > > > > > > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256) > > > > bytes from the page-pool to the skb's linear part. Those 256 bytes > > > > include part of the payload. > > > > > > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset > > > > (and skb->head_frag is not set), we end up aggregating packets in the > > > > frag_list. > > > > > > > > This is of course not good when we are CPU-limited. Also causes a worse > > > > skb->len/truesize ratio,... > > > > > > > > So, let's avoid copying parts of the payload to the linear part. We use > > > > eth_get_headlen() to parse the headers and compute the length of the > > > > protocol headers, which will be used to copy the relevant bits ot the > > > > skb's linear part. > > > > > > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking > > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate > > > > memory. > > > > > > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and > > > > LRO enabled): > > > > > > > > BEFORE: > > > > ======= > > > > (netserver pinned to core receiving interrupts) > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > > 87380 16384 262144 60.01 32547.82 > > > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > > 87380 16384 262144 60.00 52531.67 > > > > > > > > AFTER: > > > > ====== > > > > (netserver pinned to core receiving interrupts) > > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K > > > > 87380 16384 262144 60.00 52896.06 > > > > > > > > (netserver pinned to adjacent core receiving interrupts) > > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K > > > > 87380 16384 262144 60.00 85094.90 > > > > > > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/ > > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different > > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or > > > > better performance with this patch. > > > > > > > > Signed-off-by: Christoph Paasch <cpaasch@xxxxxxxxxx> > > > > --- > > > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++ > > > > 1 file changed, 5 insertions(+) > > > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644 > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen, > > > > rq->buff.map_dir); > > > > > > > > + headlen = eth_get_headlen(skb->dev, head_addr, headlen); > > > > + > > > > > > Hi, > > > > > > I am building on top of this patchset and got a kernel crash. It was > > > triggered by attaching an xdp program. > > > > > > I think the problem is skb->dev is still NULL here. It will be set later by: > > > mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans() > > > > Hmmm... Not sure what happened here... > > I'm almost certain I tested with xdp as well... > > > > I will try again later/tomorrow. > > > > Here is the command that triggers the panic: > > ip link set dev eth0 mtu 8000 xdp obj > /root/ksft-net-drv/net/lib/xdp_native.bpf.o sec xdp.frags > > and I should have attached the log: > > [ 2851.287387] BUG: kernel NULL pointer dereference, address: 0000000000000100 > [ 2851.301329] #PF: supervisor read access in kernel mode > [ 2851.311602] #PF: error_code(0x0000) - not-present page > [ 2851.321879] PGD 0 P4D 0 > [ 2851.326944] Oops: Oops: 0000 [#1] SMP > [ 2851.334272] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: loaded > Tainted: G S E 6.17.0-rc1-gcf50ef415525 #305 NONE > [ 2851.357759] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE > [ 2851.369252] Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1, > BIOS Y3DL401 09/04/2024 > [ 2851.385787] RIP: 0010:eth_get_headlen+0x16/0x90 > [ 2851.394850] Code: 5e 41 5f 5d c3 b8 f2 ff ff ff eb f0 cc cc cc cc > cc cc cc cc 0f 1f 44 00 00 41 56 53 48 83 ec 10 89 d3 83 fa 0e 72 68 > 49 89 f6 <48> 8b bf 00 01 00 00 44 0f b7 4e 0c c7 44 24 08 00 00 00 00 > 48 c7 > [ 2851.432413] RSP: 0018:ffffc90000720cc8 EFLAGS: 00010212 > [ 2851.442864] RAX: 0000000000000000 RBX: 000000000000008a RCX: 00000000000000a0 > [ 2851.457141] RDX: 000000000000008a RSI: ffff8885a5aee100 RDI: 0000000000000000 > [ 2851.471417] RBP: ffff8883d01f3900 R08: ffff888204c7c000 R09: 0000000000000000 > [ 2851.485696] R10: ffff8883d01f3900 R11: ffff8885a5aee340 R12: ffff8885add00030 > [ 2851.499969] R13: ffff8885add00030 R14: ffff8885a5aee100 R15: 0000000000000000 > [ 2851.514245] FS: 0000000000000000(0000) GS:ffff8890b4427000(0000) > knlGS:0000000000000000 > [ 2851.530433] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2851.541931] CR2: 0000000000000100 CR3: 000000107d412003 CR4: 00000000007726f0 > [ 2851.556208] PKRU: 55555554 > [ 2851.561623] Call Trace: > [ 2851.566514] <IRQ> > [ 2851.570540] mlx5e_skb_from_cqe_mpwrq_nonlinear+0x7af/0x8d0 > [ 2851.581689] mlx5e_handle_rx_cqe_mpwrq+0xbc/0x180 > [ 2851.591096] mlx5e_poll_rx_cq+0x2ef/0x780 > [ 2851.599114] mlx5e_napi_poll+0x10c/0x710 > [ 2851.606959] __napi_poll+0x28/0x160 > [ 2851.613934] net_rx_action+0x1c0/0x350 > [ 2851.621434] ? mlx5_eq_comp_int+0xdf/0x190 > [ 2851.629628] ? sched_clock+0x5/0x10 > [ 2851.636603] ? sched_clock_cpu+0xc/0x170 > [ 2851.644450] handle_softirqs+0xd8/0x280 > [ 2851.652121] __irq_exit_rcu.llvm.7416059615185659459+0x44/0xd0 > [ 2851.663788] common_interrupt+0x85/0x90 > [ 2851.671457] </IRQ> > [ 2851.675653] <TASK> > [ 2851.679850] asm_common_interrupt+0x22/0x40 Oh, I see why I didn't hit the bug when testing with xdp... I wasn't using a multi-buffer xdp prog and thus had to reduce the MTU and so ended up not using the mlx5e_skb_from_cqe_mpwrq_nonlinear() code-path... I can reproduce the panic and will fix it. Christoph > > Thanks for taking a look! > Amery > > > Thanks! > > Christoph > > > > > > > > > > > > frag_offset += headlen; > > > > byte_cnt -= headlen; > > > > linear_hr = skb_headroom(skb); > > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > pagep->frags++; > > > > while (++pagep < frag_page); > > > > } > > > > + > > > > + headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen); > > > > + > > > > __pskb_pull_tail(skb, headlen); > > > > } else { > > > > if (xdp_buff_has_frags(&mxbuf->xdp)) { > > > > > > >