Hi Chris, Sorry for the late reply, I was on holiday. On Thu, Aug 07, 2025 at 11:45:40AM -0500, Chris Arges wrote: > On 2025-07-24 17:01:16, Dragos Tatulea wrote: > > On Wed, Jul 23, 2025 at 01:48:07PM -0500, Chris Arges wrote: > > > > > > Ok, we can reproduce this problem! > > > > > > I tried to simplify this reproducer, but it seems like what's needed is: > > > - xdp program attached to mlx5 NIC > > > - cpumap redirect > > > - device redirect (map or just bpf_redirect) > > > - frame gets turned into an skb > > > Then from another machine send many flows of UDP traffic to trigger the problem. > > > > > > I've put together a program that reproduces the issue here: > > > - https://github.com/arges/xdp-redirector > > > > > Much appreciated! I fumbled around initially, not managing to get > > traffic to the xdp_devmap stage. But further debugging revealed that GRO > > needs to be enabled on the veth devices for XDP redir to work to the > > xdp_devmap. After that I managed to reproduce your issue. > > > > Now I can start looking into it. > > > > Dragos, > > There was a similar reference counting issue identified in: > https://lore.kernel.org/all/20250801170754.2439577-1-kuba@xxxxxxxxxx/ > > Part of the commit message mentioned: > > Unfortunately for fbnic since commit f7dc3248dcfb ("skbuff: Optimization > > of SKB coalescing for page pool") core _may_ actually take two extra > > pp refcounts, if one of them is returned before driver gives up the bias > > the ret < 0 check in page_pool_unref_netmem() will trigger. > > In order to help debug the mlx5 issue caused by xdp redirection, I built a > kernel with commit f7dc3248dcfb reverted, but unfortunately I was still able > to reproduce the issue. Thanks for trying this. > > I am happy to try some other experiments, or if there are other ideas you have. > I am actively debugging the issue but progress is slow as it is not an easy one. So far I have been able to trace it back to the fact that the page_pool is returning the same page twice on allocation without having a release in between. As this is quite weird, I think I still have to trace it back a few more steps to find the actual issue. Thanks, Dragos