On Wed, 9 Jul 2025, Matthew W Carlis wrote: > On Wed, 9 Jul 2025, Ilpo Järvinen wrote: > > Are you saying there's still a problem in hpc? Since the introduction of > > bwctrl, remove_board() in pciehp has had pcie_reset_lbms() (or it's > > equivalent). > > I think my concern with hpc or the current mechanism in general is that the > condition is basically binary. Across a large fleet I expect to see momentary > issues. For example a device might start to link up, have an issue & then > try to link up again and from there be working correctly. However if that > were to trigger an LBMS it might result in the quirk forcing the link to Gen1. > > For example if the quirk first guided the link to Gen1 & then if the device > linked up at Gen1 it tried to guide it to Gen2 & then if it linked up at Gen2 > it continued towards the maximum speed falling back down when it found the > device not able to achieve a certain higher speed that would be more ideal. > Or perhaps starting at the second highest speed & working its way down. > Its quite a large fall in performance for a device to go from Gen4/5 to Gen1 > whereas the ASMedia/Pericom combination was only capable of Gen2 as a pair. > If the SI is marginal for Gen4/5 I would tend to think the device has a fairly > high chance of being able to run at the next lower speed for example. This is possible but it also come at a non-trivial latency cost, Link Retraining is not very cheap. In here, you seem to suggesting the TLS quirk might be useful for other devices too besides the one on the ID list. Is that the case? (I'm asking this because it contradicts with the patch you're submitting.) I don't know if other speeds are generally useful, intuition tells they might be. However, I've no way to gather numbers as I don't have to luxury of large fleet of machines with PCIe devices to observe/measure. Perhaps you have some insight to this beyond just hypothetizing? > Actually I also wonder in the case of the ASMedia & Pericom combination > would we just see a LBMS interrupt every time the device loop between > speeds? Maybe the quirk should have been invoked by bwctrl.c when a certain > rate of LBMS assertions is detected instead? Is it better to give a device > a few chances or to catch it right away on the first issue? (some value > judgements here) I investigated this as it came up while bwctrl was under review and found various issues and challenges related to the quirk. The main problem is that bwctrl being a portdrv service means it probes quite late so it's not available very early and quirk runs mainly during that time. It might be possible to delay bringing up of a failty device to workaround that, however, it's not the end of challenges. One would need to build a state machine to make such decisions as we don't want to keep repeating it if the link is just broken. I lacked a way to test this in a meaningful way so I just gave up and left it as future work. But yes, it might be workable solution nobody has just written yet. If you want to implement this, I'm certainly not against it. (I might even consider writing one myself but that certainly isn't going to be a high priority item for me and the current level details are not concrete enough to be realized on the code level.) > > As I already mentioned, for DPC I agree, it likely should reset LBMS > > somewhere. > ... > > If you'd try this on different generations of Intel RP, you'd likely see > > variations there too, that's my experience when testing bwctrl. > > Yes agree about DPC especially given that there is likely vendor/device specific > variations in assertions of the bit. There is another patch that came into the > DPC driver which suppresses surprise down error reporting which I would like to > challenge/remove. My feeling is that the DPC driver should clear LBMS in all cases > before clearing DPC status. I suggest you make a patch to that effect. > >> Should it not matter how long ago LBMS > >> was asserted before we invoke a TLS modification? > > > > To some extent, yes, which is why we call pcie_reset_lbms() in a few > > places. > > Maybe there should even be a config or sysfs file to disable the quirk because > it kind of takes control away from users in some ways. i.e - doesn't obviously > interact well with callers of setpci etc. There's PCI_QUIRKS but that's probably not fine-grained enough to be useful in practice at it takes away all quirks. > >> I wonder if it shouldn't have to see some kind of actual link activity > >> as a prereq to entering the quirk. > > > > How would you observe that "link activity"? Doesn't LBMS itself imply > > "link activity" occurred? > > I was thinking literally not entering the quirk function unless the kernel > had witnessed LNKSTA_DLLLA or LNKSTA_LT in the last second. How can we track that condition? There's nothing that tracks DLLLA nor LT, and we can't get interrupt out of them either (AFAIK). So while it is perhaps nice on conceptual level, it would require polling those bits which doesn't look reasonable from implementation point-of-view. Also, I'm not convinced it would help your cases where you have short-term, intermitted failures during bring up. > Does this preclude us from declaring a device as "broken" as done by the quirk > without having seen DLLA within 1s after DLLSC Event? > * PCI Express Base Revision - 6.7.3.3 Data Link Layer State Changed Events > "Software must allow 1 second after the Data Link Layer Link Active bit reads 1b > before it is permitted to determine that a hot plugged device which fails to return > a Successful Completion for a Valid Configuration Request is a broken device." If you think there is problem related to spec compliance here (there well might be), patches are definitely welcome. I'm not sure from where the quirk is called in this scenario and where/how the quirk logic invocation can be delayed (unfortunately won't have time to look at it any time soon either). > > > One thing that honestly doesn't make any sense to me is the ID list in the > > > quirk. If the link comes up after forcing to Gen1 then it would only restore > > > TLS if the device is the ASMedia switch, but also ignoring what device is > > > detected downstream. If we allow ASMedia to restore the speed for any downstream > > > device when we only saw the initial issue with the Pericom switch then why > > > do we exclude Intel Root Ports or AMD Root Ports or any other bridge from the > > > list which did not have any issues reported. > > > > I think it's because the restore has been tested on that device > > (whitelist). > > > > Your reasoning is based on assumption that TLS quirk setting Link Speed > > to 2.5GT/s is part of "normal" operation. My view is that those > > triggerings are caused by not clearing stale LBMS in the right places. If > > LBMS is not wrongly kept, the quirk is no-op on all but that ID listed > > device. > > I'm making a slightly different assumption which is "something is working > until proven otherwise". We only know that the restore works on the ASMedia > when the downstream device is the Pericom switch. In fact we only know > it works for very specific layout & configuration of these two devices. > It seems wrong in my mind to be more restrictive on devices that don't have > a reported issue from, but then be less restrictive on the devices that had an > out of spec interaction in the first place. Until reported we don't know > how many devices might see LBMS get set during the course of linking up, but > then still arrive at the maximum speed. I wonder, if you could give my bwctrl tracing patch (below) a spin in some case where such a problem shows up as it could show what DLLLA/LT are while LNKSTA register is read from bwctrl's irq handler. I'm planning to submit it eventually but placement of the tracing code has not been agreed yet with the other person submitting hotplug tracing. -- From e5d7bc850028a82823c2cbb822c3ba5edaa623c1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ilpo=20J=C3=A4rvinen?= <ilpo.jarvinen@xxxxxxxxxxxxxxx> Date: Mon, 9 Jun 2025 20:29:29 +0300 Subject: [PATCH 1/1] PCI/bwctrl: Add trace event to BW notifications MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Frequent changes in the Link Speed or Width may indicate a PCIe link-layer problem. PCIe BW controller listen BW notifications, i.e., whenever LBMS (Link Bandwidth Management Status) and/or LABS (Link Autonomous Bandwidth Status) is asserted to indicate the Link Speed and/or Width was changed (PCIe spec. r6.2, sec. 7.5.3.7 & 7.5.3.8). To help troubleshooting link related problems, add trace event for LBMS and LABS assertions. I was (privately) asked to expose LBMS count for this purpose while bwctrl was under review. Lukas Wunner suggested, however, to use traceevent instead to expose finer-grained details of the LBMS assertions (namely, the timing of the assertions and link status details). Suggested-by: Lukas Wunner <lukas@xxxxxxxxx> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> --- drivers/pci/Makefile | 3 ++- drivers/pci/pci-trace.c | 9 +++++++ drivers/pci/pci.h | 1 - drivers/pci/pcie/bwctrl.c | 13 ++++++++++ include/linux/pci.h | 1 + include/trace/events/pci.h | 49 ++++++++++++++++++++++++++++++++++++++ 6 files changed, 74 insertions(+), 2 deletions(-) create mode 100644 drivers/pci/pci-trace.c create mode 100644 include/trace/events/pci.h diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 67647f1880fb..49bd51b995cd 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -5,7 +5,8 @@ obj-$(CONFIG_PCI) += access.o bus.o probe.o host-bridge.o \ remove.o pci.o pci-driver.o search.o \ rom.o setup-res.o irq.o vpd.o \ - setup-bus.o vc.o mmap.o devres.o + setup-bus.o vc.o mmap.o devres.o \ + pci-trace.o obj-$(CONFIG_PCI) += msi/ obj-$(CONFIG_PCI) += pcie/ diff --git a/drivers/pci/pci-trace.c b/drivers/pci/pci-trace.c new file mode 100644 index 000000000000..99af6466447f --- /dev/null +++ b/drivers/pci/pci-trace.c @@ -0,0 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCI trace functions + * + * Copyright (C) 2025 Intel Corporation + */ + +#define CREATE_TRACE_POINTS +#include <trace/events/pci.h> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 12215ee72afb..8f1fffcda364 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -452,7 +452,6 @@ static inline int pcie_dev_speed_mbps(enum pci_bus_speed speed) } u8 pcie_get_supported_speeds(struct pci_dev *dev); -const char *pci_speed_string(enum pci_bus_speed speed); void __pcie_print_link_status(struct pci_dev *dev, bool verbose); void pcie_report_downtraining(struct pci_dev *dev); diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c index 36f939f23d34..7fb4e00f1e7a 100644 --- a/drivers/pci/pcie/bwctrl.c +++ b/drivers/pci/pcie/bwctrl.c @@ -20,6 +20,7 @@ #define dev_fmt(fmt) "bwctrl: " fmt #include <linux/atomic.h> +#include <linux/bitfield.h> #include <linux/bitops.h> #include <linux/bits.h> #include <linux/cleanup.h> @@ -32,6 +33,8 @@ #include <linux/slab.h> #include <linux/types.h> +#include <trace/events/pci.h> + #include "../pci.h" #include "portdrv.h" @@ -208,6 +211,11 @@ static void pcie_bwnotif_disable(struct pci_dev *port) PCI_EXP_LNKCTL_LBMIE | PCI_EXP_LNKCTL_LABIE); } +#define PCI_EXP_LNKSTA_LINK_STATUS_MASK (PCI_EXP_LNKSTA_LBMS | \ + PCI_EXP_LNKSTA_LABS | \ + PCI_EXP_LNKSTA_LT | \ + PCI_EXP_LNKSTA_DLLLA) + static irqreturn_t pcie_bwnotif_irq(int irq, void *context) { struct pcie_device *srv = context; @@ -236,6 +244,11 @@ static irqreturn_t pcie_bwnotif_irq(int irq, void *context) */ pcie_update_link_speed(port->subordinate); + trace_pci_link_event(port, + link_status & PCI_EXP_LNKSTA_LINK_STATUS_MASK, + pcie_link_speed[link_status & PCI_EXP_LNKSTA_CLS], + FIELD_GET(PCI_EXP_LNKSTA_NLW, link_status)); + return IRQ_HANDLED; } diff --git a/include/linux/pci.h b/include/linux/pci.h index 05e68f35f392..8346121c035d 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -305,6 +305,7 @@ enum pci_bus_speed { PCI_SPEED_UNKNOWN = 0xff, }; +const char *pci_speed_string(enum pci_bus_speed speed); enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev); enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev); diff --git a/include/trace/events/pci.h b/include/trace/events/pci.h new file mode 100644 index 000000000000..c7187022cba5 --- /dev/null +++ b/include/trace/events/pci.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2025 Intel Corporation + */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM pci + +#if !defined(_TRACE_PCI_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_PCI_H + +#include <linux/pci.h> +#include <linux/tracepoint.h> + +#define LNKSTA_FLAGS \ + { PCI_EXP_LNKSTA_LT, "LT"}, \ + { PCI_EXP_LNKSTA_DLLLA, "DLLLA"}, \ + { PCI_EXP_LNKSTA_LBMS, "LBMS"}, \ + { PCI_EXP_LNKSTA_LABS, "LABS"} + +TRACE_EVENT(pci_link_event, + TP_PROTO(struct pci_dev *dev, u16 link_status, + enum pci_bus_speed link_speed, u8 link_width), + TP_ARGS(dev, link_status, link_speed, link_width), + + TP_STRUCT__entry( + __string( name, pci_name(dev)) + __field( u16, link_status) + __field( enum pci_bus_speed, link_speed) + __field( u8, link_width) + ), + + TP_fast_assign( + __assign_str(name); + __entry->link_status = link_status; + __entry->link_speed = link_speed; + __entry->link_width = link_width; + ), + + TP_printk("%s %s x%u st=%s", + __get_str(name), pci_speed_string(__entry->link_speed), + __entry->link_width, + __print_flags((unsigned long)__entry->link_status, "|", + LNKSTA_FLAGS)) +); + +#endif /* _TRACE_PCI_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 -- 2.39.5