I have reached out several times about issues caused by the pcie_failed_link_retrain() quirk. Initially there were some additional changes that we made to try and reduce the occurrences, but I have continued to observe issues where hot-plug slots end up at Gen1 speed when they should not have been or the quirk invoked when the link is not actually training at all. Realistically speaking this quirk is a large regression to hot-plug functionality in the kernel & therefore I am submitting this patch to restrict the quirk to the ASMedia device where the LTSSM problem was actually observed in the first place. The comment above the quirk states that the bad behavior was observed when the Asmedia ASM2824 switch was upstream of some other device. Then further asserts that the issue could in theory happen if the ASM2824 was downstream & therefore I believe this is why it was concluded that the quirk should be invoked on any device. i.e If the ASM2824 is the downstream device then we could not use its device ID to trigger the quirk action. This is flawed in the sense that it was not actually observed and may not even be an actual configuration anywhere in any device. It very well may not be the case that ASM2824 could have this issue when it is the downstream device because the issue was never root caused as far as I can tell & there is no analyzer trace. The author had a noble goal, but it seems quite difficult to capture the correct trigger & sequence of actions for every device considering we're trying to address an issue beyond compliance with the spec afaik. In my testing I have encountered alarming rates of the quirk being invoked when it should not be invoked & frequently degrading a link that would have otherwise trained to full speed/width. The impact to hot-plug reliability is observed to be extreme & therefore we believe cannot be justified to be broadly applied. In the case of hot-insert the rate of being incorrectly forced to Gen1 has been observed to be as high as 15% with some U.2 NVMe drives. It has been observed in several different system configurations with several different U.2 NVMe drives from different vendors. All of the systems that we have reproduced this issue on comply with PCI Express® Base Specification Revision 6.0 Appendix I. Async Hot-Plug Reference Model for OS controlled DPC or are very near to it. None of the systems I have tested implement Power Controller capability. The largest occurrence of this issue has been observed on systems with OOB PD (out-of-band presence detect), but it has also been observed in systems that do not have OOB PD (Using Inband-PD or DLL State Changed). Actions likely to trigger the condition where the quirk forces the link to Gen1 include physical hot-insert, slot power cycle, slot power on, toggle of fundamental reset. By observation I believe there are several timing hazards with DPC especially when using EDR. The expectation by DPC that the link should recover before returning from DPC handling work is additionally a questionable expectation in the case of a port being Hot Plug capable. Further, it appears that the quirk can be invoked two times by the DPC driver. In the case of not using HotPlug- with DPC it appears even more likely for invocation of the quirk due to different handling around SDES (Surprise down error). In my mind this makes a very complicated set of interactions even more complicated... In the case of hot-insert it appears that the power-up sequencing of drives & their boot times directly contribute to the invocation of the quirk & the link being forced to Gen1. For example, presence interrupt comes quickly (first-to-mate in U.2) however the power pins are last-to-mate (ground pins second to mate). Therefore presence can be seen even before power-up sequencing in the drive is complete. If the drive powers-up, boots and the link becomes active just after the quirk has written TLS (target link speed) to Gen1 then your drive is forced to Gen1. If the sequence takes even longer then you would see in the log "broken device" & "retraining failed", but then later DLLSC would initiate the pciehp device add sequence again which creates extreme confusion for most readers. In the case of power cycling the slot (without power controller capability) then there are differences between OOB-PD enabled systems vs systems using DLL State Change interrupts. In the case of OOB-PD the kernel will declare "Link Down", set the ctrl state to OFF_STATE, remove the device, but then immediately declare "Card present" & run down the pciehp_enable_slot() path, but would run into the quirk since the slot power be off & it not see the link train before timing out. Disabling OOB-PD & using recently deprecated Inband-PD avoids the trap more often since presence is synthesized by LTSSM & only asserted when the link is active however link degradation was still observed in pcie resilience torture testing. Unfortunately I don't have a meaningful characterization of the Inband-PD reproductions. With & without HotPlug capability the quirk becomes harder to hit after pulling in the pci/pcie/bwctrl.c changes, but is still observable in several circumstances. Those being observed around the handling of DPC with and with EDR. The bottom line from my perspective is that even with bwctrl.c we still observe a significant regression in hot-plug reliability in terms of arriving at the correct speed. In my experience the link issue observed by the author of the quirk is most likely an incompatibility between specific devices as opposed to being something that could result from degraded link integrity or device again & therefore should be restricted to the particular device where observed. Matthew W Carlis (1): PCI: pcie_failed_link_retrain() return if dev is not ASM2824 drivers/pci/quirks.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- 2.46.0