On Mon, 12 May 2025 16:02:06 +0300, Neronin, Niklas wrote: > The purpose of this series is to add some error handling and improve > warning messages. Adding more information to those messages is great and IMO it should have been done long ago instead of an xhci_dbg() that nobody enables. These errors are sporadically seen on bugzilla/linux-usb and it would be nice to know why they are happening. I have done it on my system and it helped me understand some cases. I found nothing terribly concerning, by the way. (That being said, I don't like multi-line messages: they reduce log rotation period and mix with others. I added info to existing logs.) > Fixing the root cause is another task altogether. But if you don't know why something happens, can you really fix it? In normal operation these errors just don't happen, and in abnormal operation there is no guarantee that normal recovery will work. The driver already tried to avoid error by normal means and it failed. A code with no known trigger can't be tested, so it may have bugs. And it adds maintainance burden because people fear the unknown. > Before, the only difference between a Set TR Deq command failure > and success was that, in the case of failure, a warning was printed, > and the xhci driver ring dequeue pointer was not moved. Otherwise, > a Set TR Deq command failure behaved as if it were successful. This is adequate for two most common cases that I know. Fristly, the mysterious command TRB errors on ASMedia. The endpoint is stuck (but not halted) until UAS times out and resets the whole device. Same stuck endpoint behavior, but no TRB error, is seen on other HCs. Secondly, xhci_handle_halted_endpoint() doesn't queue Reset Endpoint when a SuperSpeed device is unplugged from the root hub. The endpoint stays halted forever, so Set TR Dequeue is unnecessary and it fails. Simply ignoring the failure is not a bug. Waiting for the endpoint to reset (which never happens) turns it into a bug. (This series generally fails to check for errors of command queuing functions, which are not guaranteed to succeed.) > In my opinion, instead of ignoring the Set TR Deq errors it would be > better to retry the command if the error is due to something easily > fixed, such as wrong EP state or Slot state. There is nothing easy about stopping an endpoint, last year we spent a few weeks with Mathias making it work. See commits fd9d55d190c0 and 42b758137601, in this order ;) We won't call xhci_invalidate_cancelled_tds() unless we are strongly convinced that the endpoint is in the Stopped state, or the HC seems to be completely disfunctional. (And I just facepalmed writing this, because it looks like 28a76fcc4c85 does more harm than good, I think will need to revert it.) There is one case which cannot be detected by handle_cmd_stop_ep(): Stop Endpoint completes sucessfully, and later the endpoint changes state without a doorbell ring. ASMedia HCs actually do such things when they manage to corrupt their internal endpoint state. There are some cases when stopping and restarting a (periodic?) endpoint appears to corrupt the control endpoint; it generates no events or Transaction Errors for no reason, changes state together with other endpoints, etc. Resetting the xHC is the only solution I know. Michal