Re: [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery and health reporter

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 07/08/2025 16:11, Gerd Bayer wrote:
External email: Use caution opening links or attachments


During error recovery testing a pair of tasks was reported to be hung
due to a dead-lock situation:

- mlx5_unload_one() trying to acquire devlink lock while the PCI error
   recovery code had acquired the pci_cfg_access_lock().

could you please add traces here?
I looked at the code and didn't see where pci_cfg_access_lock() is
taken...

Thanks!

- mlx5_crdump_collect() trying to acquire the pci_cfg_access_lock()
   while devlink_health_report() had acquired the devlink lock.>
Move the check for pci_channel_offline prior to acquiring the
pci_cfg_access_lock in mlx5_vsc_gw_lock since collecting the crdump will
fail anyhow while PCI error recovery is running.

Fixes: 33afbfcc105a ("net/mlx5: Stop waiting for PCI if pci channel is offline")
Signed-off-by: Gerd Bayer <gbayer@xxxxxxxxxxxxx>
---

Hi all,

while the initial hit was recorded during "random" testing, where PCI
error recovery and poll_health() tripped almost simultaneously, I found
a way to reproduce a very similar hang at will on s390:

Inject a PCI error recovery event on a Physical Function <BDF> with
   zpcictl --reset-fw <BDF>

then request a crdump with
   devlink health dump show pci/<BDF> reporter fw_fatal

With the patch applied I didn't get the hang but kernel logs showed:
[  792.885743] mlx5_core 000a:00:00.0: mlx5_crdump_collect:51:(pid 1415): crdump: failed to lock vsc gw err -13

and the crdump request ended with:
kernel answers: Permission denied

Thanks, Gerd
---
  drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 7 +++----
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
index 432c98f2626d..d2d3b57a57d5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
@@ -73,16 +73,15 @@ int mlx5_vsc_gw_lock(struct mlx5_core_dev *dev)
         u32 lock_val;
         int ret;

+       if (pci_channel_offline(dev->pdev))
+               return -EACCES;
+
         pci_cfg_access_lock(dev->pdev);
         do {
                 if (retries > VSC_MAX_RETRIES) {
                         ret = -EBUSY;
                         goto pci_unlock;
                 }
-               if (pci_channel_offline(dev->pdev)) {
-                       ret = -EACCES;
-                       goto pci_unlock;
-               }

                 /* Check if semaphore is already locked */
                 ret = vsc_read(dev, VSC_SEMAPHORE_OFFSET, &lock_val);
--
2.48.1






[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux