Re: [BUG] ixgbe: Detected Tx Unit Hang (XDP)

Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> · Thu, 8 May 2025 21:25:12 +0200

On Mon, May 05, 2025 at 05:23:02PM +0200, Tobias Böhm wrote:
> Am 24.04.25 um 12:19 schrieb Tobias Böhm:
> > Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski:
> > > On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
> > > > Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
> > > > > On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
> > > > > > On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
> > > > > > > Am 10.04.25 um 16:30 schrieb Michal Kubiak:
> > > > > > > > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > in a setup where I use native XDP to
> > > > > > > > > redirect packets to a bonding interface
> > > > > > > > > that's backed by two ixgbe slaves, I noticed
> > > > > > > > > that the ixgbe driver constantly
> > > > > > > > > resets the NIC with the following kernel output:
> > > > > > > > > 
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
> > > > > > > > >      Tx Queue             <4>
> > > > > > > > >      TDH, TDT             <17e>, <17e>
> > > > > > > > >      next_to_use          <181>
> > > > > > > > >      next_to_clean        <17e>
> > > > > > > > >    tx_buffer_info[next_to_clean]
> > > > > > > > >      time_stamp           <0>
> > > > > > > > >      jiffies              <10025c380>
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang
> > > > > > > > > 19 detected on queue 4, resetting adapter
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2:
> > > > > > > > > initiating reset due to tx timeout
> > > > > > > > >    ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
> > > > > > > > > 
> > > > > > > > > This only occurs in combination with a
> > > > > > > > > bonding interface and XDP, so I don't
> > > > > > > > > know if this is an issue with ixgbe or the bonding driver.
> > > > > > > > > I first discovered this with Linux 6.8.0-57,
> > > > > > > > > but kernel 6.14.0 and 6.15.0-rc1
> > > > > > > > > show the same issue.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I managed to reproduce this bug in a lab
> > > > > > > > > environment. Here are some details
> > > > > > > > > about my setup and the steps to reproduce the bug:
> > > > > > > > > 
> > > > > > > > > [...]
> > > > > > > > > 
> > > > > > > > > Do you have any ideas what may be causing
> > > > > > > > > this issue or what I can do to
> > > > > > > > > diagnose this further?
> > > > > > > > > 
> > > > > > > > > Please let me know when I should provide any more information.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > Marcus
> > > > > > > > > 
> > > > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > Hi Marcus,
> > > > > > 
> > > > > > > thank you for looking into it. And not even 24 hours
> > > > > > > after my report, I'm
> > > > > > > very impressed! ;)
> > > > > > 
> > > > > > Thanks! :-)
> > > > > > 
> > > > > > > Interesting. I just tried again but had no luck yet
> > > > > > > with reproducing it
> > > > > > > without a bonding interface. May I ask how your setup looks like?
> > > > > > 
> > > > > > For now, I've just grabbed the first available system with the HW
> > > > > > controlled by the "ixgbe" driver. In my case it was:
> > > > > > 
> > > > > >    Ethernet controller: Intel Corporation Ethernet Controller X550
> > > > > > 
> > > > > > Also, for my first attempt, I didn't use the upstream
> > > > > > kernel - I just tried
> > > > > > the kernel installed on that system. It was the Fedora kernel:
> > > > > > 
> > > > > >    6.12.8-200.fc41.x86_64
> > > > > > 
> > > > > > 
> > > > > > I think that may be the "beauty" of timing issues -
> > > > > > sometimes you can change
> > > > > > just one piece in your system and get a completely
> > > > > > different replication ratio.
> > > > > > Anyway, the higher the repro probability, the easier it is to debug
> > > > > > the timing problem. :-)
> > > > > 
> > > > > Hi Marcus, to break the silence could you try to apply the
> > > > > diff below on
> > > > > your side?
> > > > 
> > > > Hi, thank you for the patch. We've tried it and with your
> > > > changes we can no
> > > > longer trigger the error and the NIC is no longer being reset.
> > > > 
> > > > > We see several issues around XDP queues in ixgbe, but before we
> > > > > proceed let's this small change on your side.
> > > > 
> > > > How confident are you that this patch is sufficient to make
> > > > things stable enough
> > > > for production use? Was it just the Tx hang detection that was
> > > > misbehaving for
> > > > the XDP case, or is there an underlying issue with the XDP
> > > > queues that is not
> > > > solved by disabling the detection for it?
> > > 
> > > I believe that correct way to approach this is to move the Tx hang
> > > detection onto ixgbe_tx_timeout() as that is the place where this logic
> > > belongs to. By doing so I suppose we would kill two birds with one stone
> > > as mentioned ndo is called under netdev watchdog which is not a subject
> > > for XDP Tx queues.
> > > 
> > > > 
> > > > With our current setup we cannot verify accurately, that we have
> > > > no packet loss
> > > > or stuck queues. We can do additional tests to verify that.
> > 
> > 
> > Hi Maciej,
> > 
> > I'm a colleague of Marcus and involved in the testing as well.
> > > > > Additional question, do you have enabled pause frames on your setup?
> > > > 
> > > > Pause frames were enabled, but we can also reproduce it after
> > > > disabling them,
> > > > without your patch.
> > > 
> > > Please give your setup a go with pause frames enabled and applied patch
> > > that i shared previously and let us see the results. As said above I do
> > > not think it is correct to check for hung queues in Tx descriptor
> > > cleaning
> > > routine. This is a job of ndo_tx_timeout callback.
> > > 
> > 
> > We have tested with pause frames enabled and applied patch and can not
> > trigger the error anymore in our lab setup.
> > 
> > > > 
> > > > Thanks!
> > > 
> > > Thanks for feedback and testing. I'll provide a proper fix tomorrow
> > > and CC
> > > you so you could take it for a spin.
> > > 
> > 
> > That sounds great. We'd be happy to test with the proper fix in our
> > original setup.
> 
> Hi,
> 
> During further testing with this patch applied we noticed new warnings that
> show up. We've also tested with the new patch sent ("[PATCH iwl-net] ixgbe:
> fix ndo_xdp_xmit() workloads") and see the same warnings.
> 
> I'm sending this observation to this thread because I'm not sure if it is
> related to those patches or if it was already present but hidden by the
> resets of the original issue reported by Marcus.
> 
> After processing test traffic (~10kk packets as described in Marcus'
> reproducer setup) and idling for a minute the following warnings keep being
> logged as long as the NIC idles:
> 
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 181
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 181
> sec
>   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 241
> sec
>   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 241
> sec
> 
> Just sending a single packet makes the warnings stop being logged.
> 
> After sending heavy test traffic again new warnings start to be logged after
> a minute of idling:
> 
>   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 60 sec
>   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 120
> sec
>   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 120
> sec
> 
> Detaching the XDP program stops the warnings as well.
> 
> As before pause frames were enabled.
> 
> Just like with the original issue we were not always successful to reproduce
> those warnings. With more traffic chances seem to be higher to trigger it.
> 
> Please let me know if I should provide any further information.

i can't reproduce this on my system but FWIW these are coming from page
pool created by xdp-trafficgen, my bet is that ixgbe Tx cleaning routine
misses two entries for some reason.

What are your ring sizes? If you're going to insist I can provide patch
that optimizes Tx cleaning processing and see if this will silence the
warnings on your side.

> 
> Thanks,
> Tobias