Re: [PATCH v2 0/1] PCI: pcie_failed_link_retrain() return if dev is not ASM2824

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 9 Jul 2025, Ilpo Järvinen wrote:

> > I wonder if it shouldn't have to see some kind of actual link activity 
> > as a prereq to entering the quirk.
> 
> How would you observe that "link activity"? Doesn't LBMS itself imply 
> "link activity" occurred?

 It does, although in this case it shouldn't have been set in the first 
place, because after reset the link never comes up (i.e. goes into the 
Link Active state) and only keeps flipping between training and not 
training, as indicated by the LT bit.  FAOD with the affected link the 
LBMS bit doesn't ever retrigger once cleared while the link is in its 
broken state.

 Once the speed has been clamped and link retrained it goes up right away
(i.e. into the Link Active state) and remains steady up, also once the 
speed has been unclamped.

 I made a test once and left the system up for half a year or so.  The 
LBMS bit was set once, a couple of days after system reset.  I cleared it 
by hand and it never retriggered for the rest of the experiment, so this 
single occasion must have been a glitch and not a link quality issue.

 During that half a year the system and the link in question were both 
used heavily in remote GNU toolchain verification over a network interface 
placed downstream the problematic link.  Traffic included NFS and SSH.  
No issues ever triggered, so I must conclude the link training issue is 
specific to speed negotiation, likely at the protocol level, rather than 
at the physical layer.

 Last year I tried to make an alternative setup using a PCIe switch option 
card using the same ASMedia device.  The card has turned out not to work 
at all (the switch reporting in the configurations space, but all the 
downstream switch permanently down) owing to the host leaving the Vaux 
line disconnected in the slot, which is a conforming configuration.  I was 
told by the option card manufacturer this is an erratum in the ASMedia 
switch device and the workaround is to drive Vaux.  I think this just 
tells what the quality of these devices is.  Sigh.

 Anyway, I chose to rework the card and tracked down a suitable miniature 
SMD switch to mount onto the PCB so as to let me select whether to drive 
ASMedia device's Vaux input from the Vaux or a regular 3.3V slot position, 
but owing to other commitments I've never got to completing this effort, 
as it requires a couple of hours of precise manual work at the workshop.  
I'll get back to it sometime and report the results.

> Any good suggestions how to realize that check more precisely to 
> differentiate if there was some link activity or not?

 The LT bit is an obvious candidate and also how I wrote a corresponding 
quirk in U-boot.  A problem however is while in U-boot it's fine to poll 
the LT bit busy-looping for a second or so, it's absolutely not in Linux 
where we have the rest of the OS running.  Sampling at random intervals 
isn't going to help as we could well miss the active state.

 FWIW it's all documented with the description of the quirk.

> > One thing that honestly doesn't make any sense to me is the ID list in the
> > quirk. If the link comes up after forcing to Gen1 then it would only restore
> > TLS if the device is the ASMedia switch, but also ignoring what device is
> > detected downstream. If we allow ASMedia to restore the speed for any downstream
> > device when we only saw the initial issue with the Pericom switch then why
> > do we exclude Intel Root Ports or AMD Root Ports or any other bridge from the
> > list which did not have any issues reported.
> 
> I think it's because the restore has been tested on that device 
> (whitelist).

 Correct, the idea has been to err on the side of caution.  The ASMedia 
device seems to cope well with this unclamping, so it's been listed, and 
so should any other device that has been confirmed to work.

 Matching the downstream and the upstream device both at a time instead, 
once this quirk has triggered and succeeded, seems to make no sense: if 
the device downstream turns out affected, then it matches the behaviour 
observed, so it should be enough to have the upstream device checked.  I 
did want to run it at full speed anyway.

 OTOH matching the downstream device likely makes sense if the quirk has 
been bypassed, such as when the link speed had been already clamped by the 
firmware.  In this case we do not really know if the clamping has been 
triggered by this erratum or something else, so such a check would be 
justified.  I don't think it's going to matter for the problems discussed 
though.

 Apologies for the irregular replies, lots on my head right now and I had 
to write this all down properly.

  Maciej




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux