Search Linux Wireless

Re: [PATCH] ath11k: pci: avoid unsafe register access during shutdown

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 12, 2025 at 4:21 AM Baochen Qiang
<baochen.qiang@xxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 8/11/2025 8:26 PM, Balsam Chihi wrote:
> > On Mon, Aug 11, 2025 at 1:00 PM Baochen Qiang
> > <baochen.qiang@xxxxxxxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 8/11/2025 5:08 PM, Balsam Chihi wrote:
> >>> On Fri, Jul 25, 2025 at 5:25 AM Baochen Qiang
> >>> <baochen.qiang@xxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 6/12/2025 12:40 AM, balsam.chihi@xxxxxxxxxxx wrote:
> >>>>> From: Balsam CHIHI <balsam.chihi@xxxxxxxxxxx>
> >>>>>
> >>>
> >>> Hello, sorry for the delay. I was OoO.
> >>>
> >>>>> During system reboot or module removal (rmmod), a crash was observed
> >>>>> due to a synchronous external abort caused by invalid register access
> >>>>> in ath11k_pci_clear_dbg_registers(). This happens when the device is
> >>>>> already powered off and the driver attempts to read debug registers.
> >>>>
> >>>> What does 'powered off' mean exactly here? is WLAN powered by an internal/external supply
> >>>> that will be removed during rmmod/reboot?
> >>>
> >>> The "powered off" means that the wlan card is rmmod or reboot.
> >>> It is not powered by external supply.
> >>
> >> OK.
> >>
> >>>
> >>>>
> >>>>>
> >>>>> The crash log shows:
> >>>>> root@OpenWrt:~# reboot
> >>>>> [  343.663492] Internal error: synchronous external abort:
> >>>>> 0000000096000210 [#1] SMP
> >>>>> [  343.670992] Modules linked in: [...]
> >>>>> [  343.842432] CPU: 7 PID: 9435 Comm: procd Tainted: G O 6.6.86 #0
> >>>>> [  343.849746] Hardware name: LS1088A
> >>>>> [  343.856969] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>>> [  343.863933] pc : ath11k_pci_get_msi_irq+0x18a0/0x1900 [ath11k_pci]
> >>>>
> >>>> I don't see any relationship between ath11k_pci_clear_dbg_registers() and this PC pointer.
> >>>> So how did you locate the problematic register access then?
> >>>
> >>> I tyied to look for the problem based the PC pointer but I was not
> >>> abled to find it.
> >>> So, I added prints in ath11k driver to find out where it stucks,
> >>> classical debug method.
> >>
> >> Yeah, classical and always work.
> >>
> >>> Then I found out that any "ath11k_pcic_read32()" call fails on rmmod
> >>> ath11k_pci or reboot,
> >>
> >> I worry PCIe link is not stable on your platform when rmmod/reboot. Can you try with ASPM
> >> disabled?
> >
> > ASPM is indeed disabled in the Kernel config.
> > I will also try with your patch and get back to you with the results.
> > And in the meanwhile, I would like to let you know the following information :
> > Other PCIe devices are working well even with ASPM enabled in the same slot,
> > and the issue is present in all PCIe slots.
>
> Thanks, the info helps.
>
> > I have many board variants of the same platform that showed the same output.
> > Maybe it is only related to the Qualcomm QCN9074 WiFi6E chipset?
>
> I think so.
>
> > I tried 2 wifi cards from two manufacturers of the same wifi chip and
> > the issue is the same.
>
> Do you have a WCN6855 WiFi chip to try?

I don't have any other WiFi card right now.
Only QCN9074 from different manufacturers behaved identically.

>
> > Is it related to the SoC Freescale Layerscape LS1088A?
>
> I never heard of such issues before. For now I am suspecting that your SoC PCIe RC behaves
> differently from other platforms, like X86.
>  > It seems like the kernel is accessing a freed resource or no longer
> > existing memory address because the wifi card is turned off.
>
> WiFi chip might enter a state where access to some specific registers could not be
> completed, hence the 'abort' error.
>
> >
> >
> >>
> >> --- a/drivers/net/wireless/ath/ath11k/pci.c
> >> +++ b/drivers/net/wireless/ath/ath11k/pci.c
> >> @@ -605,11 +605,13 @@ static void ath11k_pci_aspm_disable(struct ath11k_pci *ab_pci)
> >>
> >>  static void ath11k_pci_aspm_restore(struct ath11k_pci *ab_pci)
> >>  {
> >> +#if 0
> >>         if (test_and_clear_bit(ATH11K_PCI_ASPM_RESTORE, &ab_pci->flags))
> >>                 pcie_capability_clear_and_set_word(ab_pci->pdev, PCI_EXP_LNKCTL,
> >>                                                    PCI_EXP_LNKCTL_ASPMC,
> >>                                                    ab_pci->link_ctl &
> >>                                                    PCI_EXP_LNKCTL_ASPMC);
> >> +#endif
> >>  }
> >>
> >>  #ifdef CONFIG_DEV_COREDUMP
> >>
>
> Anyway, please try with above change. In addition you can manually confirm ASPM status in
> LinkCtrl register with lspci tool, on my machine it is like:
>
> $ sudo lspci -s 02:00.0 -vv
> 02:00.0 Network controller: Qualcomm Device 1107 (rev 01)
>         [...]
>         Capabilities: [70] Express (v2) Endpoint, MSI 00
>                 [...]
>                 LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 [...]
>

# lspci -s 0000:01:00.0 -vv
0000:01:00.0 Network controller: Qualcomm Technologies, Inc
QCN6024/9024/9074 Wireless Network Adapter (rev 01)
Subsystem: Qualcomm Technologies, Inc QCN6024/9024/9074 Wireless Network Adapter
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

I removed my patch, applied yours, and the issue came back.

According to dmesg logs, ath11k_pci_aspm_disable() is called on probe
and on remove.
The crash happens after the ath11k_pci_aspm_disable() call when rmmod or reboot.

>
> >>
> >>> but they work fine when insmod/boot ath11k_pci.
> >>>
> >>>>
> >>>>> [  343.870125] lr : ath11k_pcic_init_msi_config+0x98/0xc4 [ath11k]
> >>>>> [...]
> >>>>> [  343.950800] Call trace:
> >>>>> [  343.953241]  ath11k_pci_get_msi_irq+0x18a0/0x1900 [ath11k_pci]
> >>>>> [  343.959080]  ath11k_pcic_init_msi_config+0x98/0xc4 [ath11k]
> >>>>> [  343.964671]  ath11k_pcic_read32+0x30/0xb4 [ath11k]
> >>>>> [  343.969477]  ath11k_pci_get_msi_irq+0x528/0x1900 [ath11k_pci]
> >>>>> [  343.975230]  ath11k_pci_get_msi_irq+0x1460/0x1900 [ath11k_pci]
> >>>>> [  343.981068]  ath11k_pci_get_msi_irq+0x1750/0x1900 [ath11k_pci]
> >>>>> [  343.986906]  pci_device_shutdown+0x34/0x44
> >>>>> [  343.991004]  device_shutdown+0x160/0x268
> >>>>> [  343.994928]  kernel_restart+0x40/0xc0
> >>>>> [  343.998594]  __do_sys_reboot+0x104/0x23c
> >>>>> [  344.002518]  __arm64_sys_reboot+0x24/0x30
> >>>>> [  344.006529]  do_el0_svc+0x6c/0x100
> >>>>> [  344.009931]  el0_svc+0x28/0x9c
> >>>>> [  344.012986]  el0t_64_sync_handler+0x120/0x12c
> >>>>> [  344.017344]  el0t_64_sync+0x178/0x17c
> >>>>> [  344.021009] Code: f94e0a80 92404a73 91420273 8b130013 (b9400273)
> >>>>> [  344.027102] ---[ end trace 0000000000000000 ]---
> >>>>>
> >>>>> This patch adds a `power_on` flag to conditionally skip debug register
> >>>>> reads when the device is not powered, preventing invalid memory access.
> >>>>
> >>>> why is only 'read' concerned here?
> >>>
> >>> Good question, I can't explain why only the reading fails but not the writing.
> >>>
> >>>>
> >>>> You are skipping several 'read' in this change, could you help narrow down which
> >>>> individual read instance causes this?
> >>>>
> >>>
> >>> All read calls fail on rmmod/reboot, when
> >>> ath11k_pci_clear_dbg_registers() is called.
> >>> But they all succeed when insmod/boot ath11k_pci.
> >>>
> >>>>>
> >>>>> With this change, the system no longer crashes on reboot or rmmod,
> >>>>> and the driver continues to function correctly when reloaded.
> >>>>>
> >>>>> Tested-on: QCN9074 hw2.0 PCIe on LS1088A
> >>>>> Tested-on: OpenWrt 24.10.1
> >>>>> Fixes: crash on reboot/rmmod due to invalid register access
> >>>>> Signed-off-by: Balsam CHIHI <balsam.chihi@xxxxxxxxxxx>
> >>>>> ---
> >>>>>  drivers/net/wireless/ath/ath11k/pci.c | 36 ++++++++++++++++-----------
> >>>>>  1 file changed, 21 insertions(+), 15 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/net/wireless/ath/ath11k/pci.c b/drivers/net/wireless/ath/ath11k/pci.c
> >>>>> index 78444f8ea153..0c1b55957557 100644
> >>>>> --- a/drivers/net/wireless/ath/ath11k/pci.c
> >>>>> +++ b/drivers/net/wireless/ath/ath11k/pci.c
> >>>>> @@ -203,16 +203,18 @@ static void ath11k_pci_soc_global_reset(struct ath11k_base *ab)
> >>>>>               ath11k_warn(ab, "link down error during global reset\n");
> >>>>>  }
> >>>>>
> >>>>> -static void ath11k_pci_clear_dbg_registers(struct ath11k_base *ab)
> >>>>> +static void ath11k_pci_clear_dbg_registers(struct ath11k_base *ab, bool power_on)
> >>>>>  {
> >>>>> -     u32 val;
> >>>>> +     if (power_on) {
> >>>>> +             u32 val;
> >>>>>
> >>>>> -     /* read cookie */
> >>>>> -     val = ath11k_pcic_read32(ab, PCIE_Q6_COOKIE_ADDR);
> >>>>> -     ath11k_dbg(ab, ATH11K_DBG_PCI, "pcie_q6_cookie_addr 0x%x\n", val);
> >>>>> +             /* read cookie */
> >>>>> +             val = ath11k_pcic_read32(ab, PCIE_Q6_COOKIE_ADDR);
> >>>>> +             ath11k_dbg(ab, ATH11K_DBG_PCI, "pcie_q6_cookie_addr 0x%x\n", val);
> >>>>>
> >>>>> -     val = ath11k_pcic_read32(ab, WLAON_WARM_SW_ENTRY);
> >>>>> -     ath11k_dbg(ab, ATH11K_DBG_PCI, "wlaon_warm_sw_entry 0x%x\n", val);
> >>>>> +             val = ath11k_pcic_read32(ab, WLAON_WARM_SW_ENTRY);
> >>>>> +             ath11k_dbg(ab, ATH11K_DBG_PCI, "wlaon_warm_sw_entry 0x%x\n", val);
> >>>>> +     }
> >>>>>
> >>>>>       /* TODO: exact time to sleep is uncertain */
> >>>>>       mdelay(10);
> >>>>> @@ -223,14 +225,18 @@ static void ath11k_pci_clear_dbg_registers(struct ath11k_base *ab)
> >>>>>       ath11k_pcic_write32(ab, WLAON_WARM_SW_ENTRY, 0);
> >>>>>       mdelay(10);
> >>>>>
> >>>>> -     val = ath11k_pcic_read32(ab, WLAON_WARM_SW_ENTRY);
> >>>>> -     ath11k_dbg(ab, ATH11K_DBG_PCI, "wlaon_warm_sw_entry 0x%x\n", val);
> >>>>> +     if (power_on) {
> >>>>> +             u32 val;
> >>>>>
> >>>>> -     /* A read clear register. clear the register to prevent
> >>>>> -      * Q6 from entering wrong code path.
> >>>>> -      */
> >>>>> -     val = ath11k_pcic_read32(ab, WLAON_SOC_RESET_CAUSE_REG);
> >>>>> -     ath11k_dbg(ab, ATH11K_DBG_PCI, "soc reset cause %d\n", val);
> >>>>> +             val = ath11k_pcic_read32(ab, WLAON_WARM_SW_ENTRY);
> >>>>> +             ath11k_dbg(ab, ATH11K_DBG_PCI, "wlaon_warm_sw_entry 0x%x\n", val);
> >>>>> +
> >>>>> +             /* A read clear register. clear the register to prevent
> >>>>> +             * Q6 from entering wrong code path.
> >>>>> +             */
> >>>>> +             val = ath11k_pcic_read32(ab, WLAON_SOC_RESET_CAUSE_REG);
> >>>>> +             ath11k_dbg(ab, ATH11K_DBG_PCI, "soc reset cause %d\n", val);
> >>>>> +     }
> >>>>>  }
> >>>>>
> >>>>>  static int ath11k_pci_set_link_reg(struct ath11k_base *ab,
> >>>>> @@ -368,7 +374,7 @@ static void ath11k_pci_sw_reset(struct ath11k_base *ab, bool power_on)
> >>>>>       }
> >>>>>
> >>>>>       ath11k_mhi_clear_vector(ab);
> >>>>> -     ath11k_pci_clear_dbg_registers(ab);
> >>>>> +     ath11k_pci_clear_dbg_registers(ab, power_on);
> >>>>>       ath11k_pci_soc_global_reset(ab);
> >>>>>       ath11k_mhi_set_mhictrl_reset(ab);
> >>>>>  }
> >>>>
> >>>
> >>
>





[Index of Archives]     [Linux Host AP]     [ATH6KL]     [Linux Wireless Personal Area Network]     [Linux Bluetooth]     [Wireless Regulations]     [Linux Netdev]     [Kernel Newbies]     [Linux Kernel]     [IDE]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite Hiking]     [MIPS Linux]     [ARM Linux]     [Linux RAID]

  Powered by Linux