On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote: > When running iperf through a set of XDP programs we were able to crash > machines with NICs using the mlx5_core driver. We were able to confirm > that other NICs/drivers did not exhibit the same problem, and suspect > this could be a memory management issue in the driver code. > Specifically we found a WARNING at include/net/page_pool/helpers.h:277 > mlx5e_page_release_fragmented.isra. We are able to demonstrate this > issue in production using hardware, but cannot easily bisect because > we don’t have a simple reproducer. > Thanks for the report! We will investigate. > I wanted to share stack traces in > order to help us further debug and understand if anyone else has run > into this issue. We are currently working on getting more crashdumps > and doing further analysis. > > > The test setup looks like the following: > ┌─────┐ > │mlx5 │ > │NIC │ > └──┬──┘ > │xdp ebpf program (does encap and XDP_TX) > │ > ▼ > ┌──────────────────────┐ > │xdp.frags │ > │ │ > └──┬───────────────────┘ > │tailcall > │BPF_REDIRECT_MAP (using CPUMAP bpf type) > ▼ > ┌──────────────────────┐ > │xdp.frags/cpumap │ > │ │ > └──┬───────────────────┘ > │BPF_REDIRECT to veth (*potential trigger for issue) > │ > ▼ > ┌──────┐ > │veth │ > │ │ > └──┬───┘ > │ > │ > ▼ > > Here an mlx5 NIC has an xdp.frags program attached which tailcalls via > BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can > choose a random valid CPU to reproduce the issue. Once that packet > reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT > to a veth device which has an XDP program which redirects to an > XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the > veth device that we noticed this issue. > Would it be possible to try to use a single program that redirects to the XSKMAP and check that the issue reproduces? > When running with 6.12.30 to 6.12.32 kernels we are able to see the > following KASAN use-after-free WARNINGs followed by a page fault which > crashes the machine. We have not been able to test earlier or later > kernels. I’ve tried to map symbols to lines of code for clarity. > Thanks for the KASAN reports, they are very useful. Keep us posted if you have other updates. A first quick look didn't reveal anything obvious from our side but we will keep looking. Thanks, Dragos