On Fri, Jul 04, 2025 at 08:14:20PM +0000, Dragos Tatulea wrote: > On Fri, Jul 04, 2025 at 12:37:36PM +0000, Dragos Tatulea wrote: > > On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote: > > > When running iperf through a set of XDP programs we were able to crash > > > machines with NICs using the mlx5_core driver. We were able to confirm > > > that other NICs/drivers did not exhibit the same problem, and suspect > > > this could be a memory management issue in the driver code. > > > Specifically we found a WARNING at include/net/page_pool/helpers.h:277 > > > mlx5e_page_release_fragmented.isra. We are able to demonstrate this > > > issue in production using hardware, but cannot easily bisect because > > > we don’t have a simple reproducer. > > > > > Thanks for the report! We will investigate. > > > > > I wanted to share stack traces in > > > order to help us further debug and understand if anyone else has run > > > into this issue. We are currently working on getting more crashdumps > > > and doing further analysis. > > > > > > > > > The test setup looks like the following: > > > ┌─────┐ > > > │mlx5 │ > > > │NIC │ > > > └──┬──┘ > > > │xdp ebpf program (does encap and XDP_TX) > > > │ > > > ▼ > > > ┌──────────────────────┐ > > > │xdp.frags │ > > > │ │ > > > └──┬───────────────────┘ > > > │tailcall > > > │BPF_REDIRECT_MAP (using CPUMAP bpf type) > > > ▼ > > > ┌──────────────────────┐ > > > │xdp.frags/cpumap │ > > > │ │ > > > └──┬───────────────────┘ > > > │BPF_REDIRECT to veth (*potential trigger for issue) > > > │ > > > ▼ > > > ┌──────┐ > > > │veth │ > > > │ │ > > > └──┬───┘ > > > │ > > > │ > > > ▼ > > > > > > Here an mlx5 NIC has an xdp.frags program attached which tailcalls via > > > BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can > > > choose a random valid CPU to reproduce the issue. Once that packet > > > reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT > > > to a veth device which has an XDP program which redirects to an > > > XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the > > > veth device that we noticed this issue. > > > > > Would it be possible to try to use a single program that redirects to > > the XSKMAP and check that the issue reproduces? > > > I forgot to ask: what is the MTU size? > Also, are you setting any other special config on the device? > > Thanks, > Dragos Dragos, The device has the following settings: 2: ext0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1600 xdp qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:48:7f:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 520 numrxqueues 65 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 gso_ipv4_max_size 65536 gro_ipv4_max_size 65536 portname p0 switchid e87f480003da341c parentbus pci parentdev 0000:c1:00.0 prog/xdp id 173 As far as testing other packet paths to help narrow down the problem we tested: 1) Fails: XDP (mlx5 nic) -> CPU MAP -> DEV MAP (to veth) -> XSK 2) Works: XDP (mlx5 nic) -> CPU MAP -> Linux routing (to veth) -> XSK 3) Works: XDP (mlx5 nic) -> Linux routing (to veth) -> XSK Given those cases, I would think a single program that redirects just to XSKMAP would also work fine. Thanks, --chris