On Mon, Aug 04, 2025 at 06:10:48PM GMT, Sean Anderson wrote: > On 8/4/25 16:57, Bjorn Helgaas wrote: > > [+cc more folks who might be interested in AER with non-standard > > interrupts] > > > > On Fri, Aug 01, 2025 at 01:43:19PM -0400, Sean Anderson wrote: > >> Hi, > >> > >> AER correctable errors are pretty rare. I only saw one once before and > >> came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc > >> interrupt messages") in response. I saw another today and, > >> unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is > >> not sufficient to handle the IRQ. It gets immediately re-raised, > >> preventing the system from making any other progress. I suspect that it > >> needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never > >> gets delivered to aer_irq, those registers never get tickled. > >> > >> The underlying problem is that pcieport thinks that the IRQ is going to > >> be one of the MSIs or a legacy interrupt, but it's actually a native > >> interrupt: > >> > >> CPU0 CPU1 CPU2 CPU3 > >> 42: 0 0 0 0 GICv2 150 Level nwl_pcie:misc > >> 45: 0 0 0 0 nwl_pcie:legacy 0 Level PCIe PME, aerdrv > >> 46: 25 0 0 0 nwl_pcie:msi 524288 Edge nvme0q0 > >> 47: 0 0 0 0 nwl_pcie:msi 524289 Edge nvme0q1 > >> 48: 0 0 0 0 nwl_pcie:msi 524290 Edge nvme0q2 > >> 49: 46 0 0 0 nwl_pcie:msi 524291 Edge nvme0q3 > >> 50: 0 0 0 0 nwl_pcie:msi 524292 Edge nvme0q4 > >> > >> In the above example, AER errors will trigger interrupt 42, not 45. > >> Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS, > >> so maybe nwl_pcie_misc_handler should be an interrupt controller > >> instead? But even then pcie_port_enable_irq_vec() won't figure out the > >> correct IRQ. Any ideas on how to fix this? > >> > >> Additionally, any tips on actually triggering AER/PME stuff in a > >> consistent way? Are there any off-the-shelf cards for sending weird PCIe > >> stuff over a link for testing? Right now all I have > > > > This is definitely a problem. We have had some discussion about this > > in the past, but haven't quite achieved critical mass to solve this in > > a generic way. Here are some links: > > > > https://lore.kernel.org/linux-pci/20250702223841.GA1905230@bhelgaas/t/#u > > https://lore.kernel.org/linux-pci/1464242406-20203-1-git-send-email-po.liu@xxxxxxx/ > > Thanks for the links. Toggling PERST does seem to reliably cause > correctable errors (however "correctable" they may actually be in > practice). With the patch I posted on the other branch of this chain I > now get > > [ 43.041610] pcieport 0000:00:00.0: AER: Multiple Corrected error message received from 0000:00:00.0 > [ 43.050693] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) > [ 43.061477] pcieport 0000:00:00.0: device [10ee:d011] error status/mask=00000001/0000e000 > [ 43.069842] pcieport 0000:00:00.0: [ 0] RxErr > > Whether or not that's the right fix, at least I can test things :) Could you please check if INTX is working for AER? You can just pass the cmdline parameter, "pcie_pme=nomsi" and observe if the IRQ is getting triggered. We have a desire to add platform IRQs for AER, but before doing that we need to make sure that the platform doesn't support both MSI and INTx. - Mani -- மணிவண்ணன் சதாசிவம்