On 6/3/25 11:57 PM, James Bottomley wrote: > On Tue, 2025-06-03 at 07:41 -0700, Christoph Hellwig wrote: >> [taking this private to discuss the mpt drivers] >> >>> Hmmm... DID_SOFT_ERROR... Normally, this is an immediate retry as >>> this normally is used to indicate that a command is a collateral >>> abort due to an NCQ error, and per ATA spec, that command should be >>> retried. However, the *BAD* thing about Broadcom HBAs using this is >>> that it increments the command retry counter, so if a command ends >>> up being retried more than 5 times due to other commands failing, >>> the command runs out of retries and is failed like this. The >>> command retry counter should *not* be incremented for NCQ >>> collateral aborts. I tried to fix this, but it is impossible as we >>> actually do not know if this is a collateral abort or something >>> else. The HBA events used to handle completion do not allow >>> differentiation. Waiting on Broadcom to do something about this >>> (the mpi3mr HBA driver has the same nasty issue). >> >> Maybe we should just change the mpt3 sas/mr drivers to use >> DID_SOFT_ERROR less? In fact there's not really a whole lot of >> DID_SOFT_ERROR users otherwise, and there's probably better status >> codes whatever they are doing can be translated to that do not >> increment the retry counter. > > The status code that does that (retry without incrementing the counter) > is DID_IMM_RETRY. The driver has to be a bit careful about using this > because we can get into infinite retry loops. James, Thank you for the information. Will have a try again at changing the driver to use this. -- Damien Le Moal Western Digital Research