Hello Mani, On Fri, Aug 29, 2025 at 09:44:08PM +0530, Manivannan Sadhasivam wrote: > On Fri, Aug 15, 2025 at 11:07:42AM GMT, Niklas Cassel wrote: (snip) > > > > > ## On EP side: > > > > > # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ > > > > > sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start > > > > > > > > > > Basically all tests timeout > > > > > # FAILED: 1 / 16 tests passed. > > > > > > > > > > Which is the same as before this patch series. > > > > > > This is kind of expected since the pci_endpoint_test driver doesn't have the AER > > > err_handlers defined. > > > > I see. > > Would be nice if we could add them then, so that we can verify that this > > series is working as intended. (snip) > Ok, thanks for the logs. I guess what is happening here is that we are not > saving/restoring the config space of the devices under the Root Port if linkdown > is happens. TBH, we cannot do that from the PCI core since once linkdown > happens, we cannot access any devices underneath the Root Port. But if > err_handlers are available for drivers for all devices, they could do something > smart like below: > > diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c > index c4e5e2c977be..9aabf1fe902e 100644 > --- a/drivers/misc/pci_endpoint_test.c > +++ b/drivers/misc/pci_endpoint_test.c > @@ -989,6 +989,8 @@ static int pci_endpoint_test_probe(struct pci_dev *pdev, > > pci_set_drvdata(pdev, test); > > + pci_save_state(pdev); > + > id = ida_alloc(&pci_endpoint_test_ida, GFP_KERNEL); > if (id < 0) { > ret = id; > @@ -1140,12 +1142,31 @@ static const struct pci_device_id pci_endpoint_test_tbl[] = { > }; > MODULE_DEVICE_TABLE(pci, pci_endpoint_test_tbl); > > +static pci_ers_result_t pci_endpoint_test_error_detected(struct pci_dev *pdev, > + pci_channel_state_t state) > +{ > + return PCI_ERS_RESULT_NEED_RESET; > +} > + > +static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev) > +{ > + pci_restore_state(pdev); > + > + return PCI_ERS_RESULT_RECOVERED; > +} > + > +static const struct pci_error_handlers pci_endpoint_test_err_handler = { > + .error_detected = pci_endpoint_test_error_detected, > + .slot_reset = pci_endpoint_test_slot_reset, > +}; > + > static struct pci_driver pci_endpoint_test_driver = { > .name = DRV_MODULE_NAME, > .id_table = pci_endpoint_test_tbl, > .probe = pci_endpoint_test_probe, > .remove = pci_endpoint_test_remove, > .sriov_configure = pci_sriov_configure_simple, > + .err_handler = &pci_endpoint_test_err_handler, > }; > module_pci_driver(pci_endpoint_test_driver); > > This essentially saves the good known config space during probe and restores it > during the slot_reset callback. Ofc, the state would've been overwritten if > suspend/resume happens in-between, but the point I'm making is that unless all > device drivers restore their known config space, devices cannot be resumed > properly post linkdown recovery. > > I can add a patch based on the above diff in next revision if that helps. Right > now, I do not have access to my endpoint test setup. So can't test anything. I tested your patch series + your suggested change above, and after a: ## On EP side: # echo 0 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start && \ sleep 0.1 && echo 1 > /sys/kernel/config/pci_ep/controllers/a40000000.pcie-ep/start Instead of: # FAILED: 1 / 16 tests passed. I now get: # FAILED: 7 / 16 tests passed. Test cases 1-7 now passes (the test cases related to BARs), all other test cases still fail: # /pcitest TAP version 13 1..16 # Starting 16 tests from 9 test cases. # RUN pci_ep_bar.BAR0.BAR_TEST ... # OK pci_ep_bar.BAR0.BAR_TEST ok 1 pci_ep_bar.BAR0.BAR_TEST # RUN pci_ep_bar.BAR1.BAR_TEST ... # OK pci_ep_bar.BAR1.BAR_TEST ok 2 pci_ep_bar.BAR1.BAR_TEST # RUN pci_ep_bar.BAR2.BAR_TEST ... # OK pci_ep_bar.BAR2.BAR_TEST ok 3 pci_ep_bar.BAR2.BAR_TEST # RUN pci_ep_bar.BAR3.BAR_TEST ... # OK pci_ep_bar.BAR3.BAR_TEST ok 4 pci_ep_bar.BAR3.BAR_TEST # RUN pci_ep_bar.BAR4.BAR_TEST ... # SKIP BAR is disabled # OK pci_ep_bar.BAR4.BAR_TEST ok 5 pci_ep_bar.BAR4.BAR_TEST # SKIP BAR is disabled # RUN pci_ep_bar.BAR5.BAR_TEST ... # OK pci_ep_bar.BAR5.BAR_TEST ok 6 pci_ep_bar.BAR5.BAR_TEST # RUN pci_ep_basic.CONSECUTIVE_BAR_TEST ... # OK pci_ep_basic.CONSECUTIVE_BAR_TEST ok 7 pci_ep_basic.CONSECUTIVE_BAR_TEST # RUN pci_ep_basic.LEGACY_IRQ_TEST ... # pci_endpoint_test.c:106:LEGACY_IRQ_TEST:Expected 0 (0) == ret (-110) # pci_endpoint_test.c:106:LEGACY_IRQ_TEST:Test failed for Legacy IRQ # LEGACY_IRQ_TEST: Test failed # FAIL pci_ep_basic.LEGACY_IRQ_TEST not ok 8 pci_ep_basic.LEGACY_IRQ_TEST # RUN pci_ep_basic.MSI_TEST ... # pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110) # pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI1 # pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110) # pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI2 # pci_endpoint_test.c:118:MSI_TEST:Expected 0 (0) == ret (-110) # pci_endpoint_test.c:118:MSI_TEST:Test failed for MSI3 ... I think I know the reason.. you save the state before the IRQs have been allocated. Perhaps we need to save the state after enabling IRQs? I tried this patch on top of your patch: --- a/drivers/misc/pci_endpoint_test.c +++ b/drivers/misc/pci_endpoint_test.c @@ -851,6 +851,8 @@ static int pci_endpoint_test_set_irq(struct pci_endpoint_test *test, return ret; } + pci_save_state(pdev); + return 0; } But still: # FAILED: 7 / 16 tests passed. So... apparently that did not help... I tried with the following change as well (on top of my patch above): +static pci_ers_result_t pci_endpoint_test_slot_reset(struct pci_dev *pdev) +{ + struct pci_endpoint_test *test = pci_get_drvdata(pdev); + int irq_type = test->irq_type; + + pci_restore_state(pdev); + + if (irq_type != PCITEST_IRQ_TYPE_UNDEFINED) { + pci_endpoint_test_clear_irq(test); + pci_endpoint_test_set_irq(test, irq_type); + } + + return PCI_ERS_RESULT_RECOVERED; +} But still only: # FAILED: 7 / 16 tests passed. Do you have any suggestions? Kind regards, Niklas