[PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824

Matthew W Carlis <mattc@xxxxxxxxxxxxxxx> · Thu, 3 Jul 2025 17:53:13 -0600

On Thu, 3 Jul 2025, Ilpo Järvinen wrote:
> Is this mainly related to some artificial test that rapidly fires event 
> after another (which is known to confuse the quirk)? ...I mean, you say 
> "extremely likely".

I wouldn't describe the test as "rapidly fires" of events because we have given
conservative delays between injections (waiting for DLLA & being able to perform
IO to the nvme block device before potentially injecting again). In any case
the testing results are clearly worse when moving from a kernel that didn't
have the quirk to a kernel that does which is a regression in my mind.

> I suppose when the problem occurs and the bridge remains at 2.5GT/s, is it 
> possible to restore the higher speed using the pcie_cooling device 
> associated with the bridge / bwctrl? You can find the correct cooling 
> device with this:

Yes the problem is when a device is forced to 2.5GT/s and it should not have
been. I did not test with the patches for CONFIG_PCIE_THERMAL because our drives
would not need thermal management by the kernel, but if I use "setpci" to
restore TLS & then write the link retrain bit the link would arrive at the
maximum speed (Gen3/Gen4/Gen5 depending).

I have other vendor drives as well, but we design and build our own drives
with our own firmware & therefore are able to determine from firmware logging
in the drive when the link was most likely guided to 2.5GT/s by TLS. We are
also able to see the 2.5GT/s value in the TLS register when it happens. I have
less visibility into drives from other vendors in terms of ltssm transitions
without hooking up an analyzer.