On Thu, 10 Apr 2025, Lukas Wunner wrote: > When a Secondary Bus Reset is issued at a hotplug port, it causes a Data > Link Layer State Changed event as a side effect. On hotplug ports using > in-band presence detect, it additionally causes a Presence Detect Changed > event. > > These spurious events should not result in teardown and re-enumeration of > the device in the slot. Hence commit 2e35afaefe64 ("PCI: pciehp: Add > reset_slot() method") masked the Presence Detect Changed Enable bit in the > Slot Control register during a Secondary Bus Reset. Commit 06a8d89af551 > ("PCI: pciehp: Disable link notification across slot reset") additionally > masked the Data Link Layer State Changed Enable bit. > > However masking those bits only disables interrupt generation (PCIe r6.2 > sec 6.7.3.1). The events are still visible in the Slot Status register > and picked up by the IRQ handler if it runs during a Secondary Bus Reset. > This can happen if the interrupt is shared or if an unmasked hotplug event > occurs, e.g. Attention Button Pressed or Power Fault Detected. > > The likelihood of this happening used to be small, so it wasn't much of a > problem in practice. That has changed with the recent introduction of > bandwidth control in v6.13-rc1 with commit 665745f27487 ("PCI/bwctrl: > Re-add BW notification portdrv as PCIe BW controller"): > > Bandwidth control shares the interrupt with PCIe hotplug. A Secondary Bus > Reset causes a Link Bandwidth Notification, so the hotplug IRQ handler > runs, picks up the masked events and tears down the device in the slot. > > As a result, Joel reports VFIO passthrough failure of a GPU, which Ilpo > root-caused to the incorrect handling of masked hotplug events. > > Clearly, a more reliable way is needed to ignore spurious hotplug events. > > For Downstream Port Containment, a new ignore mechanism was introduced by > commit a97396c6eb13 ("PCI: pciehp: Ignore Link Down/Up caused by DPC"). > It has been working reliably for the past four years. > > Adapt it for Secondary Bus Resets. > > Introduce two helpers to annotate code sections which cause spurious link > changes: pci_hp_ignore_link_change() and pci_hp_unignore_link_change() > Use those helpers in lieu of masking interrupts in the Slot Control > register. > > Introduce a helper to check whether such a code section is executing > concurrently and if so, await it: pci_hp_spurious_link_change() > Invoke the helper in the hotplug IRQ thread pciehp_ist(). Re-use the > IRQ thread's existing code which ignores DPC-induced link changes unless > the link is unexpectedly down after reset recovery or the device was > replaced during the bus reset. > > That code block in pciehp_ist() was previously only executed if a Data > Link Layer State Changed event has occurred. Additionally execute it for > Presence Detect Changed events. That's necessary for compatibility with > PCIe r1.0 hotplug ports because Data Link Layer State Changed didn't exist > before PCIe r1.1. DPC was added with PCIe r3.1 and thus DPC-capable > hotplug ports always support Data Link Layer State Changed events. > But the same cannot be assumed for Secondary Bus Reset, which already > existed in PCIe r1.0. > > Secondary Bus Reset is only one of many causes of spurious link changes. > Others include runtime suspend to D3cold, firmware updates or FPGA > reconfiguration. The new pci_hp_{,un}ignore_link_change() helpers may be > used by all kinds of drivers to annotate such code sections, hence their > declarations are publicly visible in <linux/pci.h>. A case in point is > the Mellanox Ethernet driver which disables a firmware reset feature if > the Ethernet card is attached to a hotplug port, see commit 3d7a3f2612d7 > ("net/mlx5: Nack sync reset request when HotPlug is enabled"). Going > forward, PCIe hotplug will be able to cope gracefully with all such use > cases once the code sections are properly annotated. > > The new helpers internally use two bits in struct pci_dev's priv_flags as > well as a wait_queue. This mirrors what was done for DPC by commit > a97396c6eb13 ("PCI: pciehp: Ignore Link Down/Up caused by DPC"). That may > be insufficient if spurious link changes are caused by multiple sources > simultaneously. An example might be a Secondary Bus Reset issued by AER > during FPGA reconfiguration. If this turns out to happen in real life, > support for it can easily be added by replacing the PCI_LINK_CHANGING flag > with an atomic_t counter incremented by pci_hp_ignore_link_change() and > decremented by pci_hp_unignore_link_change(). Instead of awaiting a zero > PCI_LINK_CHANGING flag, the pci_hp_spurious_link_change() helper would > then simply await a zero counter. > > Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller") > Reported-by: Joel Mathew Thomas <proxy0@xxxxxxxxxxxx> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219765 > Tested-by: Joel Mathew Thomas <proxy0@xxxxxxxxxxxx> > Signed-off-by: Lukas Wunner <lukas@xxxxxxxxx> > Cc: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> > --- > drivers/pci/hotplug/pci_hotplug_core.c | 69 ++++++++++++++++++++++++++++++++++ > drivers/pci/hotplug/pciehp_hpc.c | 35 ++++++----------- > drivers/pci/pci.h | 3 ++ > include/linux/pci.h | 8 ++++ > 4 files changed, 92 insertions(+), 23 deletions(-) > > diff --git a/drivers/pci/hotplug/pci_hotplug_core.c b/drivers/pci/hotplug/pci_hotplug_core.c > index d30f131..d8c5856 100644 > --- a/drivers/pci/hotplug/pci_hotplug_core.c > +++ b/drivers/pci/hotplug/pci_hotplug_core.c > @@ -492,6 +492,75 @@ void pci_hp_destroy(struct hotplug_slot *slot) > } > EXPORT_SYMBOL_GPL(pci_hp_destroy); > > +static DECLARE_WAIT_QUEUE_HEAD(pci_hp_link_change_wq); > + > +/** > + * pci_hp_ignore_link_change - begin code section causing spurious link changes > + * @pdev: PCI hotplug bridge > + * > + * Mark the beginning of a code section causing spurious link changes on the > + * Secondary Bus of @pdev, e.g. as a side effect of a Secondary Bus Reset, > + * D3cold transition, firmware update or FPGA reconfiguration. > + * > + * Hotplug drivers can thus check whether such a code section is executing > + * concurrently, await it with pci_hp_spurious_link_change() and ignore the > + * resulting link change events. > + * > + * Must be paired with pci_hp_unignore_link_change(). May be called both > + * from the PCI core and from Endpoint drivers. May be called for bridges > + * which are not hotplug-capable, in which case it has no effect because > + * no hotplug driver is bound to the bridge. > + */ > +void pci_hp_ignore_link_change(struct pci_dev *pdev) > +{ > + set_bit(PCI_LINK_CHANGING, &pdev->priv_flags); > + smp_mb__after_atomic(); /* pairs with implied barrier of wait_event() */ > +} > + > +/** > + * pci_hp_unignore_link_change - end code section causing spurious link changes > + * @pdev: PCI hotplug bridge > + * > + * Mark the end of a code section causing spurious link changes on the > + * Secondary Bus of @pdev. Must be paired with pci_hp_ignore_link_change(). > + */ > +void pci_hp_unignore_link_change(struct pci_dev *pdev) > +{ > + set_bit(PCI_LINK_CHANGED, &pdev->priv_flags); > + mb(); /* ensure pci_hp_spurious_link_change() sees either bit set */ > + clear_bit(PCI_LINK_CHANGING, &pdev->priv_flags); > + wake_up_all(&pci_hp_link_change_wq); This change should have added these: #include <linux/bitops.h> #include <linux/wait.h> #include <asm/barrier.h> -- i.