On 2025-07-24 17:01:16, Dragos Tatulea wrote: > On Wed, Jul 23, 2025 at 01:48:07PM -0500, Chris Arges wrote: > > > > Ok, we can reproduce this problem! > > > > I tried to simplify this reproducer, but it seems like what's needed is: > > - xdp program attached to mlx5 NIC > > - cpumap redirect > > - device redirect (map or just bpf_redirect) > > - frame gets turned into an skb > > Then from another machine send many flows of UDP traffic to trigger the problem. > > > > I've put together a program that reproduces the issue here: > > - https://github.com/arges/xdp-redirector > > > Much appreciated! I fumbled around initially, not managing to get > traffic to the xdp_devmap stage. But further debugging revealed that GRO > needs to be enabled on the veth devices for XDP redir to work to the > xdp_devmap. After that I managed to reproduce your issue. > > Now I can start looking into it. > Dragos, There was a similar reference counting issue identified in: https://lore.kernel.org/all/20250801170754.2439577-1-kuba@xxxxxxxxxx/ Part of the commit message mentioned: > Unfortunately for fbnic since commit f7dc3248dcfb ("skbuff: Optimization > of SKB coalescing for page pool") core _may_ actually take two extra > pp refcounts, if one of them is returned before driver gives up the bias > the ret < 0 check in page_pool_unref_netmem() will trigger. In order to help debug the mlx5 issue caused by xdp redirection, I built a kernel with commit f7dc3248dcfb reverted, but unfortunately I was still able to reproduce the issue. I am happy to try some other experiments, or if there are other ideas you have. Thanks, --chris