Re: [PATCH] PCI/DPC: Extend DPC recovery timeout

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 11, 2025 at 06:13:01AM +0200, Lukas Wunner wrote:
> On Fri, Jul 11, 2025 at 11:20:15AM +0800, Hongbo Yao wrote:
> > 2025/7/8 1:04, Sathyanarayanan Kuppuswamy:
> > > On 7/7/25 3:30 AM, Andy Xu wrote:
> > > > Setting timeout to 7s covers both devices with safety margin.
> > > 
> > > Instead of updating the recovery time, can you check why your device
> > > recovery takes
> > > such a long time and how to fix it from the device end?
> > 
> > I fully agree that ideally the root cause should be addressed on the
> > device side to reduce the DPC recovery latency, and that waiting longer
> > in the kernel is not a perfect solution.
> > 
> > However, the current 4 seconds timeout in pci_dpc_recovered() is indeed
> > an empirical value rather than a hard requirement from the PCIe
> > specification. In real-world scenarios, like with Mellanox ConnectX-5/7
> > adapters, we've observed that full DPC recovery can take more than 5-6
> > seconds, which leads to premature hotplug processing and device removal.
> 
> I think Sathya's point was:  Have you made an effort to talk to the
> vendor and ask them to root-cause and fix the issue e.g. with a firmware
> update.

Would definitely be great, but unless we have a number in the spec to
point to, they might just shrug and ask what the requirement is.

> > To improve robustness and maintain flexibility, I???m considering
> > introducing a module parameter to allow tuning the DPC recovery timeout
> > dynamically. Would you like me to prepare and submit such a patch for
> > review?
> 
> We try to avoid adding new module parameters.  Things should just work
> out of the box without the user having to adjust the kernel command
> line for their system.
> 
> So the solution is indeed to either adjust the delay for everyone
> (as you've done) or introduce an unsigned int to struct pci_dev
> which can be assigned the delay after reset for the device to be
> responsive.
> 
> For comparison, we're allowing up to 60 sec for devices to become
> available after a Fundamental Reset or Conventional Reset
> (PCIE_RESET_READY_POLL_MS).  That's how long we're waiting in
> dpc_reset_link() -> pci_bridge_wait_for_secondary_bus() and
> we're not consistent with that when we wait only 4 sec in
> pci_dpc_recovered().
> 
> I think the reason is that we weren't really sure whether this approach
> to synchronize hotplug with DPC works well and how to choose delays.
> But we've had this for a few years now and it seems to have worked nicely
> for people.  I think this is the first report where it's not been
> working out of the box.

Why would we wait less than PCIE_RESET_READY_POLL_MS?  DPC disables
the link, so that's basically a reset for the device.  Seems like we
should allow as much time as we do for any other kind of reset.

Bjorn




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux