On Mon, May 05, 2025 at 05:23:02PM +0200, Tobias Böhm wrote: > Am 24.04.25 um 12:19 schrieb Tobias Böhm: > > Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski: > > > On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote: > > > > Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski: > > > > > On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote: > > > > > > On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote: > > > > > > > Am 10.04.25 um 16:30 schrieb Michal Kubiak: > > > > > > > > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote: > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > in a setup where I use native XDP to > > > > > > > > > redirect packets to a bonding interface > > > > > > > > > that's backed by two ixgbe slaves, I noticed > > > > > > > > > that the ixgbe driver constantly > > > > > > > > > resets the NIC with the following kernel output: > > > > > > > > > > > > > > > > > > ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP) > > > > > > > > > Tx Queue <4> > > > > > > > > > TDH, TDT <17e>, <17e> > > > > > > > > > next_to_use <181> > > > > > > > > > next_to_clean <17e> > > > > > > > > > tx_buffer_info[next_to_clean] > > > > > > > > > time_stamp <0> > > > > > > > > > jiffies <10025c380> > > > > > > > > > ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang > > > > > > > > > 19 detected on queue 4, resetting adapter > > > > > > > > > ixgbe 0000:01:00.1 ixgbe-x520-2: > > > > > > > > > initiating reset due to tx timeout > > > > > > > > > ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter > > > > > > > > > > > > > > > > > > This only occurs in combination with a > > > > > > > > > bonding interface and XDP, so I don't > > > > > > > > > know if this is an issue with ixgbe or the bonding driver. > > > > > > > > > I first discovered this with Linux 6.8.0-57, > > > > > > > > > but kernel 6.14.0 and 6.15.0-rc1 > > > > > > > > > show the same issue. > > > > > > > > > > > > > > > > > > > > > > > > > > > I managed to reproduce this bug in a lab > > > > > > > > > environment. Here are some details > > > > > > > > > about my setup and the steps to reproduce the bug: > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > Do you have any ideas what may be causing > > > > > > > > > this issue or what I can do to > > > > > > > > > diagnose this further? > > > > > > > > > > > > > > > > > > Please let me know when I should provide any more information. > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > Marcus > > > > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > Hi Marcus, > > > > > > > > > > > > > thank you for looking into it. And not even 24 hours > > > > > > > after my report, I'm > > > > > > > very impressed! ;) > > > > > > > > > > > > Thanks! :-) > > > > > > > > > > > > > Interesting. I just tried again but had no luck yet > > > > > > > with reproducing it > > > > > > > without a bonding interface. May I ask how your setup looks like? > > > > > > > > > > > > For now, I've just grabbed the first available system with the HW > > > > > > controlled by the "ixgbe" driver. In my case it was: > > > > > > > > > > > > Ethernet controller: Intel Corporation Ethernet Controller X550 > > > > > > > > > > > > Also, for my first attempt, I didn't use the upstream > > > > > > kernel - I just tried > > > > > > the kernel installed on that system. It was the Fedora kernel: > > > > > > > > > > > > 6.12.8-200.fc41.x86_64 > > > > > > > > > > > > > > > > > > I think that may be the "beauty" of timing issues - > > > > > > sometimes you can change > > > > > > just one piece in your system and get a completely > > > > > > different replication ratio. > > > > > > Anyway, the higher the repro probability, the easier it is to debug > > > > > > the timing problem. :-) > > > > > > > > > > Hi Marcus, to break the silence could you try to apply the > > > > > diff below on > > > > > your side? > > > > > > > > Hi, thank you for the patch. We've tried it and with your > > > > changes we can no > > > > longer trigger the error and the NIC is no longer being reset. > > > > > > > > > We see several issues around XDP queues in ixgbe, but before we > > > > > proceed let's this small change on your side. > > > > > > > > How confident are you that this patch is sufficient to make > > > > things stable enough > > > > for production use? Was it just the Tx hang detection that was > > > > misbehaving for > > > > the XDP case, or is there an underlying issue with the XDP > > > > queues that is not > > > > solved by disabling the detection for it? > > > > > > I believe that correct way to approach this is to move the Tx hang > > > detection onto ixgbe_tx_timeout() as that is the place where this logic > > > belongs to. By doing so I suppose we would kill two birds with one stone > > > as mentioned ndo is called under netdev watchdog which is not a subject > > > for XDP Tx queues. > > > > > > > > > > > With our current setup we cannot verify accurately, that we have > > > > no packet loss > > > > or stuck queues. We can do additional tests to verify that. > > > > > > Hi Maciej, > > > > I'm a colleague of Marcus and involved in the testing as well. > > > > > Additional question, do you have enabled pause frames on your setup? > > > > > > > > Pause frames were enabled, but we can also reproduce it after > > > > disabling them, > > > > without your patch. > > > > > > Please give your setup a go with pause frames enabled and applied patch > > > that i shared previously and let us see the results. As said above I do > > > not think it is correct to check for hung queues in Tx descriptor > > > cleaning > > > routine. This is a job of ndo_tx_timeout callback. > > > > > > > We have tested with pause frames enabled and applied patch and can not > > trigger the error anymore in our lab setup. > > > > > > > > > > Thanks! > > > > > > Thanks for feedback and testing. I'll provide a proper fix tomorrow > > > and CC > > > you so you could take it for a spin. > > > > > > > That sounds great. We'd be happy to test with the proper fix in our > > original setup. > > Hi, > > During further testing with this patch applied we noticed new warnings that > show up. We've also tested with the new patch sent ("[PATCH iwl-net] ixgbe: > fix ndo_xdp_xmit() workloads") and see the same warnings. > > I'm sending this observation to this thread because I'm not sure if it is > related to those patches or if it was already present but hidden by the > resets of the original issue reported by Marcus. > > After processing test traffic (~10kk packets as described in Marcus' > reproducer setup) and idling for a minute the following warnings keep being > logged as long as the NIC idles: > > page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 60 sec > page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 60 sec > page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 120 > sec > page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 120 > sec > page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 181 > sec > page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 181 > sec > page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 241 > sec > page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 241 > sec > > Just sending a single packet makes the warnings stop being logged. > > After sending heavy test traffic again new warnings start to be logged after > a minute of idling: > > page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 60 sec > page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 60 sec > page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 120 > sec > page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 120 > sec > > Detaching the XDP program stops the warnings as well. > > As before pause frames were enabled. > > Just like with the original issue we were not always successful to reproduce > those warnings. With more traffic chances seem to be higher to trigger it. > > Please let me know if I should provide any further information. i can't reproduce this on my system but FWIW these are coming from page pool created by xdp-trafficgen, my bet is that ixgbe Tx cleaning routine misses two entries for some reason. What are your ring sizes? If you're going to insist I can provide patch that optimizes Tx cleaning processing and see if this will silence the warnings on your side. > > Thanks, > Tobias