On Tue, 8 Jul 2025, Matthew W Carlis wrote: > On Fri, 4 Jul 2025, Ilpo Järvinen wrote: > > The other question still stands though, why is LBMS is not reset? Perhaps > > DPC should clear LBMS in some places (that is, call pcie_reset_lbms()). > > Have you consider that? > > Initially we started to observe this when physically removing and > reinserting devices in a kernel version with the quirk, but without the bandwidth > controller driver. I think there is a problem with any place where the link > would be expected to go down (dpc, hpc, etc) & then carrying forward LBMS > into the next time the link comes up. Are you saying there's still a problem in hpc? Since the introduction of bwctrl, remove_board() in pciehp has had pcie_reset_lbms() (or it's equivalent). As I already mentioned, for DPC I agree, it likely should reset LBMS somewhere. We also clear LBMS after retraining to not retain that LBMS beyond the completion of the retraining. What other things are included into that "etc"? > Should it not matter how long ago LBMS > was asserted before we invoke a TLS modification? To some extent, yes, which is why we call pcie_reset_lbms() in a few places. > It also looks like card > presence is enough for the kernel to believe the link should train & enter > the quirk function without ever having seen LNKSTA_DLLLA or LNKSTA_LT. Without LBMS that won't do anything in the quirk (except try raise the Link Speed if it's the particular device on the whitelist). > I wonder if it shouldn't have to see some kind of actual link activity > as a prereq to entering the quirk. How would you observe that "link activity"? Doesn't LBMS itself imply "link activity" occurred? Any good suggestions how to realize that check more precisely to differentiate if there was some link activity or not? > > (It sound to me you're having this occur in multiple scenarios and I've > > some trouble on figuring those out from your long descriptions what those > > exactly are so it's bit challenging for me to suggest where it should be > > done but I the surprise down certainly seems like case where LBMS > > information must have become stale so it should be reset which would > > prevent quirk from setting 2.5GT/s) > > Something I found recently that was interesting - when I power off > a slot (triggering DPC via SDES) the LBMS becomes set on Intel Root Ports, > but in another server with a PCIe switch LBMS does not become set on the > switch DSP if I perform the same action. I don't have any explanation for > this difference other than "vendor specific" behavior. If you'd try this on different generations of Intel RP, you'd likely see variations there too, that's my experience when testing bwctrl. E.g., on some platforms, I see LBMS asserted twice from single retraining (after a TLS change). One when still having LT=1 and the other after LT=0. (I don't have explanation to that behavior.) > One thing that honestly doesn't make any sense to me is the ID list in the > quirk. If the link comes up after forcing to Gen1 then it would only restore > TLS if the device is the ASMedia switch, but also ignoring what device is > detected downstream. If we allow ASMedia to restore the speed for any downstream > device when we only saw the initial issue with the Pericom switch then why > do we exclude Intel Root Ports or AMD Root Ports or any other bridge from the > list which did not have any issues reported. I think it's because the restore has been tested on that device (whitelist). Your reasoning is based on assumption that TLS quirk setting Link Speed to 2.5GT/s is part of "normal" operation. My view is that those triggerings are caused by not clearing stale LBMS in the right places. If LBMS is not wrongly kept, the quirk is no-op on all but that ID listed device. -- i.