On Fri, Jul 04, 2025 at 12:37:36PM +0000, Dragos Tatulea wrote: > On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote: > > When running iperf through a set of XDP programs we were able to crash > > machines with NICs using the mlx5_core driver. We were able to confirm > > that other NICs/drivers did not exhibit the same problem, and suspect > > this could be a memory management issue in the driver code. > > Specifically we found a WARNING at include/net/page_pool/helpers.h:277 > > mlx5e_page_release_fragmented.isra. We are able to demonstrate this > > issue in production using hardware, but cannot easily bisect because > > we don’t have a simple reproducer. > > > Thanks for the report! We will investigate. > > > I wanted to share stack traces in > > order to help us further debug and understand if anyone else has run > > into this issue. We are currently working on getting more crashdumps > > and doing further analysis. > > > > > > The test setup looks like the following: > > ┌─────┐ > > │mlx5 │ > > │NIC │ > > └──┬──┘ > > │xdp ebpf program (does encap and XDP_TX) > > │ > > ▼ > > ┌──────────────────────┐ > > │xdp.frags │ > > │ │ > > └──┬───────────────────┘ > > │tailcall > > │BPF_REDIRECT_MAP (using CPUMAP bpf type) > > ▼ > > ┌──────────────────────┐ > > │xdp.frags/cpumap │ > > │ │ > > └──┬───────────────────┘ > > │BPF_REDIRECT to veth (*potential trigger for issue) > > │ > > ▼ > > ┌──────┐ > > │veth │ > > │ │ > > └──┬───┘ > > │ > > │ > > ▼ > > > > Here an mlx5 NIC has an xdp.frags program attached which tailcalls via > > BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can > > choose a random valid CPU to reproduce the issue. Once that packet > > reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT > > to a veth device which has an XDP program which redirects to an > > XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the > > veth device that we noticed this issue. > > > Would it be possible to try to use a single program that redirects to > the XSKMAP and check that the issue reproduces? > I forgot to ask: what is the MTU size? Also, are you setting any other special config on the device? Thanks, Dragos