[Bug 220069] [6.13.9] regression USB controller dies

bugzilla-daemon@xxxxxxxxxx · Thu, 01 May 2025 07:29:48 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=220069

--- Comment #12 from Michał Pecio (michal.pecio@xxxxxxxxx) ---
(In reply to Claudio Wunder from comment #10)
> > I think you said you have more of those logs, is the above always appearing
> a
> > few seconds before "hc died"? It seems related to the 8-3 device, a VIA USB
> > 3.0 hub.
> 
> For the sample of two items I have so far, it appears that these are showing
> up. Note that on both the original regression and the current "apparent" one
> (if we can even call it a regression?), these errors above are happening. I
> will need to wait to see the next crash also happens to have said logs;
Wait, this is important. If you were seeing "Abort failed to stop command ring:
-110" instead of "xHCI host not responding to stop endpoint command" before
6.13.7 then it is at least possible, if not likely, that you were already
running into a different problem than the one fixed in 6.13.7. And it gets
doubly suspicious if you also saw "ERROR unknown event type <some number>" a
few seconds before "HC died". Do you still have those logs by any chance?

As Mathias Nyman explained, the known 6.13 issue was a simple driver bug:
commands were written incorrectly, chips correctly ignored them, the driver
incorrectly pronounced them dead.

Mathias further suggests that this or similar bug may still somehow exist in
your kernel and that command abort fails because the chip believes there are no
pending commands. That is possible, but unlikely because command abort is not
supposed to fail like that. So if you ever seem command abort timeout, either
the abort code is buggy (and it looks like no one touched that part in ages) or
the chip is buggy in one way or another.

It would be sad if this turns out to be a regression due to the commits
initially suspected back in February:
https://bugzilla.kernel.org/show_bug.cgi?id=219824#c5

These are present in all 6.12 and higher releases from this year, so the only
supported kernels without them are old LTS series. Not sure if you have means
of testing those for a few weeks on the same HW, userspace and workload?

I could also suggest some stress tests which exercise this code (and the USB
controller). I found webcams and USB serial dongles to be particularly
suitable, do you have some of such stuff at hand?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.