On Fri, Jul 11, 2025 at 06:13:01AM +0200, Lukas Wunner wrote: > On Fri, Jul 11, 2025 at 11:20:15AM +0800, Hongbo Yao wrote: > > 2025/7/8 1:04, Sathyanarayanan Kuppuswamy: > > > On 7/7/25 3:30 AM, Andy Xu wrote: > > > > Setting timeout to 7s covers both devices with safety margin. > > > > > > Instead of updating the recovery time, can you check why your device > > > recovery takes > > > such a long time and how to fix it from the device end? > > > > I fully agree that ideally the root cause should be addressed on the > > device side to reduce the DPC recovery latency, and that waiting longer > > in the kernel is not a perfect solution. > > > > However, the current 4 seconds timeout in pci_dpc_recovered() is indeed > > an empirical value rather than a hard requirement from the PCIe > > specification. In real-world scenarios, like with Mellanox ConnectX-5/7 > > adapters, we've observed that full DPC recovery can take more than 5-6 > > seconds, which leads to premature hotplug processing and device removal. > > I think Sathya's point was: Have you made an effort to talk to the > vendor and ask them to root-cause and fix the issue e.g. with a firmware > update. Would definitely be great, but unless we have a number in the spec to point to, they might just shrug and ask what the requirement is. > > To improve robustness and maintain flexibility, I???m considering > > introducing a module parameter to allow tuning the DPC recovery timeout > > dynamically. Would you like me to prepare and submit such a patch for > > review? > > We try to avoid adding new module parameters. Things should just work > out of the box without the user having to adjust the kernel command > line for their system. > > So the solution is indeed to either adjust the delay for everyone > (as you've done) or introduce an unsigned int to struct pci_dev > which can be assigned the delay after reset for the device to be > responsive. > > For comparison, we're allowing up to 60 sec for devices to become > available after a Fundamental Reset or Conventional Reset > (PCIE_RESET_READY_POLL_MS). That's how long we're waiting in > dpc_reset_link() -> pci_bridge_wait_for_secondary_bus() and > we're not consistent with that when we wait only 4 sec in > pci_dpc_recovered(). > > I think the reason is that we weren't really sure whether this approach > to synchronize hotplug with DPC works well and how to choose delays. > But we've had this for a few years now and it seems to have worked nicely > for people. I think this is the first report where it's not been > working out of the box. Why would we wait less than PCIE_RESET_READY_POLL_MS? DPC disables the link, so that's basically a reset for the device. Seems like we should allow as much time as we do for any other kind of reset. Bjorn