Re: Problematic Set TR Deq error handling series in xhci-next

Michał Pecio <michal.pecio@xxxxxxxxx> · Tue, 13 May 2025 12:29:52 +0200

On Mon, 12 May 2025 16:02:06 +0300, Neronin, Niklas wrote:
> The purpose of this series is to add some error handling and improve
> warning messages.

Adding more information to those messages is great and IMO it should
have been done long ago instead of an xhci_dbg() that nobody enables.
These errors are sporadically seen on bugzilla/linux-usb and it would
be nice to know why they are happening.

I have done it on my system and it helped me understand some cases.
I found nothing terribly concerning, by the way.

(That being said, I don't like multi-line messages: they reduce log
rotation period and mix with others. I added info to existing logs.)

> Fixing the root cause is another task altogether.

But if you don't know why something happens, can you really fix it?

In normal operation these errors just don't happen, and in abnormal
operation there is no guarantee that normal recovery will work. The
driver already tried to avoid error by normal means and it failed.

A code with no known trigger can't be tested, so it may have bugs.
And it adds maintainance burden because people fear the unknown.

> Before, the only difference between a Set TR Deq command failure
> and success was that, in the case of failure, a warning was printed,
> and the xhci driver ring dequeue pointer was not moved. Otherwise,
> a Set TR Deq command failure behaved as if it were successful.

This is adequate for two most common cases that I know. Fristly, the
mysterious command TRB errors on ASMedia. The endpoint is stuck (but
not halted) until UAS times out and resets the whole device. Same
stuck endpoint behavior, but no TRB error, is seen on other HCs.

Secondly, xhci_handle_halted_endpoint() doesn't queue Reset Endpoint
when a SuperSpeed device is unplugged from the root hub. The endpoint
stays halted forever, so Set TR Dequeue is unnecessary and it fails.

Simply ignoring the failure is not a bug. Waiting for the endpoint to
reset (which never happens) turns it into a bug.

(This series generally fails to check for errors of command queuing
functions, which are not guaranteed to succeed.)

> In my opinion, instead of ignoring the Set TR Deq errors it would be
> better to retry the command if the error is due to something easily
> fixed, such as wrong EP state or Slot state.

There is nothing easy about stopping an endpoint, last year we spent
a few weeks with Mathias making it work.

See commits fd9d55d190c0 and 42b758137601, in this order ;)

We won't call xhci_invalidate_cancelled_tds() unless we are strongly
convinced that the endpoint is in the Stopped state, or the HC seems
to be completely disfunctional. (And I just facepalmed writing this,
because it looks like 28a76fcc4c85 does more harm than good, I think
will need to revert it.)

There is one case which cannot be detected by handle_cmd_stop_ep():
Stop Endpoint completes sucessfully, and later the endpoint changes
state without a doorbell ring. ASMedia HCs actually do such things
when they manage to corrupt their internal endpoint state. There
are some cases when stopping and restarting a (periodic?) endpoint
appears to corrupt the control endpoint; it generates no events or
Transaction Errors for no reason, changes state together with other
endpoints, etc. Resetting the xHC is the only solution I know.

Michal