On 8/1/25 13:43, Sean Anderson wrote: > Hi, > > AER correctable errors are pretty rare. I only saw one once before and > came up with commit 78457cae24cb ("PCI: xilinx-nwl: Rate-limit misc > interrupt messages") in response. I saw another today and, > unfortunately, clearing the correctable AER bit in MSGF_MISC_STATUS is > not sufficient to handle the IRQ. It gets immediately re-raised, > preventing the system from making any other progress. I suspect that it > needs to be cleared in PCI_ERR_ROOT_STATUS. But since the AER IRQ never > gets delivered to aer_irq, those registers never get tickled. > > The underlying problem is that pcieport thinks that the IRQ is going to > be one of the MSIs or a legacy interrupt, but it's actually a native > interrupt: > > CPU0 CPU1 CPU2 CPU3 > 42: 0 0 0 0 GICv2 150 Level nwl_pcie:misc > 45: 0 0 0 0 nwl_pcie:legacy 0 Level PCIe PME, aerdrv > 46: 25 0 0 0 nwl_pcie:msi 524288 Edge nvme0q0 > 47: 0 0 0 0 nwl_pcie:msi 524289 Edge nvme0q1 > 48: 0 0 0 0 nwl_pcie:msi 524290 Edge nvme0q2 > 49: 46 0 0 0 nwl_pcie:msi 524291 Edge nvme0q3 > 50: 0 0 0 0 nwl_pcie:msi 524292 Edge nvme0q4 > > In the above example, AER errors will trigger interrupt 42, not 45. > Actually, there are a bunch of different interrupts in MSGF_MISC_STATUS, > so maybe nwl_pcie_misc_handler should be an interrupt controller > instead? But even then pcie_port_enable_irq_vec() won't figure out the > correct IRQ. Any ideas on how to fix this? OK, so as a first pass, maybe something like if (misc_stat & (MSGF_MISC_SR_FATAL_AER | MSGF_MISC_SR_NON_FATAL_AER MSGF_MISC_SR_CORR_AER)) generic_handle_domain_irq(pcie->legacy_irq_domain, 0); to simulate the correct IRQ. I have no idea whether it's safe to call generic_handle_domain_irq in this context. It wasn't OK for AER (see commit 9ae052253785 ("PCI/AER: Fix the broken interrupt injection")), but maybe it's OK for us since the legacy irqchip doesn't support affinity? I CC'd Thomas and maybe he can comment. Otherwise, maybe the best thing is to just add an API to manually trigger AER. > Additionally, any tips on actually triggering AER/PME stuff in a > consistent way? Are there any off-the-shelf cards for sending weird PCIe > stuff over a link for testing? Right now all I have But I still don't know how to test this. I can inject a misc interrupt since the GIC supports irq_set_irqchip_state, but that won't really simulate an AER interrupt since MSGF_MISC_STATUS won't have the right bit set. Maybe I can wiggle a card around in its slot? Maybe PME or link bandwidth notification could trigger this as well? --Sean