On Tue, Aug 12, 2025 at 4:21 AM Baochen Qiang <baochen.qiang@xxxxxxxxxxxxxxxx> wrote: > > > > On 8/11/2025 8:26 PM, Balsam Chihi wrote: > > On Mon, Aug 11, 2025 at 1:00 PM Baochen Qiang > > <baochen.qiang@xxxxxxxxxxxxxxxx> wrote: > >> > >> > >> > >> On 8/11/2025 5:08 PM, Balsam Chihi wrote: > >>> On Fri, Jul 25, 2025 at 5:25 AM Baochen Qiang > >>> <baochen.qiang@xxxxxxxxxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 6/12/2025 12:40 AM, balsam.chihi@xxxxxxxxxxx wrote: > >>>>> From: Balsam CHIHI <balsam.chihi@xxxxxxxxxxx> > >>>>> > >>> > >>> Hello, sorry for the delay. I was OoO. > >>> > >>>>> During system reboot or module removal (rmmod), a crash was observed > >>>>> due to a synchronous external abort caused by invalid register access > >>>>> in ath11k_pci_clear_dbg_registers(). This happens when the device is > >>>>> already powered off and the driver attempts to read debug registers. > >>>> > >>>> What does 'powered off' mean exactly here? is WLAN powered by an internal/external supply > >>>> that will be removed during rmmod/reboot? > >>> > >>> The "powered off" means that the wlan card is rmmod or reboot. > >>> It is not powered by external supply. > >> > >> OK. > >> > >>> > >>>> > >>>>> > >>>>> The crash log shows: > >>>>> root@OpenWrt:~# reboot > >>>>> [ 343.663492] Internal error: synchronous external abort: > >>>>> 0000000096000210 [#1] SMP > >>>>> [ 343.670992] Modules linked in: [...] > >>>>> [ 343.842432] CPU: 7 PID: 9435 Comm: procd Tainted: G O 6.6.86 #0 > >>>>> [ 343.849746] Hardware name: LS1088A > >>>>> [ 343.856969] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >>>>> [ 343.863933] pc : ath11k_pci_get_msi_irq+0x18a0/0x1900 [ath11k_pci] > >>>> > >>>> I don't see any relationship between ath11k_pci_clear_dbg_registers() and this PC pointer. > >>>> So how did you locate the problematic register access then? > >>> > >>> I tyied to look for the problem based the PC pointer but I was not > >>> abled to find it. > >>> So, I added prints in ath11k driver to find out where it stucks, > >>> classical debug method. > >> > >> Yeah, classical and always work. > >> > >>> Then I found out that any "ath11k_pcic_read32()" call fails on rmmod > >>> ath11k_pci or reboot, > >> > >> I worry PCIe link is not stable on your platform when rmmod/reboot. Can you try with ASPM > >> disabled? > > > > ASPM is indeed disabled in the Kernel config. > > I will also try with your patch and get back to you with the results. > > And in the meanwhile, I would like to let you know the following information : > > Other PCIe devices are working well even with ASPM enabled in the same slot, > > and the issue is present in all PCIe slots. > > Thanks, the info helps. > > > I have many board variants of the same platform that showed the same output. > > Maybe it is only related to the Qualcomm QCN9074 WiFi6E chipset? > > I think so. > > > I tried 2 wifi cards from two manufacturers of the same wifi chip and > > the issue is the same. > > Do you have a WCN6855 WiFi chip to try? I don't have any other WiFi card right now. Only QCN9074 from different manufacturers behaved identically. > > > Is it related to the SoC Freescale Layerscape LS1088A? > > I never heard of such issues before. For now I am suspecting that your SoC PCIe RC behaves > differently from other platforms, like X86. > > It seems like the kernel is accessing a freed resource or no longer > > existing memory address because the wifi card is turned off. > > WiFi chip might enter a state where access to some specific registers could not be > completed, hence the 'abort' error. > > > > > > >> > >> --- a/drivers/net/wireless/ath/ath11k/pci.c > >> +++ b/drivers/net/wireless/ath/ath11k/pci.c > >> @@ -605,11 +605,13 @@ static void ath11k_pci_aspm_disable(struct ath11k_pci *ab_pci) > >> > >> static void ath11k_pci_aspm_restore(struct ath11k_pci *ab_pci) > >> { > >> +#if 0 > >> if (test_and_clear_bit(ATH11K_PCI_ASPM_RESTORE, &ab_pci->flags)) > >> pcie_capability_clear_and_set_word(ab_pci->pdev, PCI_EXP_LNKCTL, > >> PCI_EXP_LNKCTL_ASPMC, > >> ab_pci->link_ctl & > >> PCI_EXP_LNKCTL_ASPMC); > >> +#endif > >> } > >> > >> #ifdef CONFIG_DEV_COREDUMP > >> > > Anyway, please try with above change. In addition you can manually confirm ASPM status in > LinkCtrl register with lspci tool, on my machine it is like: > > $ sudo lspci -s 02:00.0 -vv > 02:00.0 Network controller: Qualcomm Device 1107 (rev 01) > [...] > Capabilities: [70] Express (v2) Endpoint, MSI 00 > [...] > LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > [...] > # lspci -s 0000:01:00.0 -vv 0000:01:00.0 Network controller: Qualcomm Technologies, Inc QCN6024/9024/9074 Wireless Network Adapter (rev 01) Subsystem: Qualcomm Technologies, Inc QCN6024/9024/9074 Wireless Network Adapter Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- I removed my patch, applied yours, and the issue came back. According to dmesg logs, ath11k_pci_aspm_disable() is called on probe and on remove. The crash happens after the ath11k_pci_aspm_disable() call when rmmod or reboot. > > >> > >>> but they work fine when insmod/boot ath11k_pci. > >>> > >>>> > >>>>> [ 343.870125] lr : ath11k_pcic_init_msi_config+0x98/0xc4 [ath11k] > >>>>> [...] > >>>>> [ 343.950800] Call trace: > >>>>> [ 343.953241] ath11k_pci_get_msi_irq+0x18a0/0x1900 [ath11k_pci] > >>>>> [ 343.959080] ath11k_pcic_init_msi_config+0x98/0xc4 [ath11k] > >>>>> [ 343.964671] ath11k_pcic_read32+0x30/0xb4 [ath11k] > >>>>> [ 343.969477] ath11k_pci_get_msi_irq+0x528/0x1900 [ath11k_pci] > >>>>> [ 343.975230] ath11k_pci_get_msi_irq+0x1460/0x1900 [ath11k_pci] > >>>>> [ 343.981068] ath11k_pci_get_msi_irq+0x1750/0x1900 [ath11k_pci] > >>>>> [ 343.986906] pci_device_shutdown+0x34/0x44 > >>>>> [ 343.991004] device_shutdown+0x160/0x268 > >>>>> [ 343.994928] kernel_restart+0x40/0xc0 > >>>>> [ 343.998594] __do_sys_reboot+0x104/0x23c > >>>>> [ 344.002518] __arm64_sys_reboot+0x24/0x30 > >>>>> [ 344.006529] do_el0_svc+0x6c/0x100 > >>>>> [ 344.009931] el0_svc+0x28/0x9c > >>>>> [ 344.012986] el0t_64_sync_handler+0x120/0x12c > >>>>> [ 344.017344] el0t_64_sync+0x178/0x17c > >>>>> [ 344.021009] Code: f94e0a80 92404a73 91420273 8b130013 (b9400273) > >>>>> [ 344.027102] ---[ end trace 0000000000000000 ]--- > >>>>> > >>>>> This patch adds a `power_on` flag to conditionally skip debug register > >>>>> reads when the device is not powered, preventing invalid memory access. > >>>> > >>>> why is only 'read' concerned here? > >>> > >>> Good question, I can't explain why only the reading fails but not the writing. > >>> > >>>> > >>>> You are skipping several 'read' in this change, could you help narrow down which > >>>> individual read instance causes this? > >>>> > >>> > >>> All read calls fail on rmmod/reboot, when > >>> ath11k_pci_clear_dbg_registers() is called. > >>> But they all succeed when insmod/boot ath11k_pci. > >>> > >>>>> > >>>>> With this change, the system no longer crashes on reboot or rmmod, > >>>>> and the driver continues to function correctly when reloaded. > >>>>> > >>>>> Tested-on: QCN9074 hw2.0 PCIe on LS1088A > >>>>> Tested-on: OpenWrt 24.10.1 > >>>>> Fixes: crash on reboot/rmmod due to invalid register access > >>>>> Signed-off-by: Balsam CHIHI <balsam.chihi@xxxxxxxxxxx> > >>>>> --- > >>>>> drivers/net/wireless/ath/ath11k/pci.c | 36 ++++++++++++++++----------- > >>>>> 1 file changed, 21 insertions(+), 15 deletions(-) > >>>>> > >>>>> diff --git a/drivers/net/wireless/ath/ath11k/pci.c b/drivers/net/wireless/ath/ath11k/pci.c > >>>>> index 78444f8ea153..0c1b55957557 100644 > >>>>> --- a/drivers/net/wireless/ath/ath11k/pci.c > >>>>> +++ b/drivers/net/wireless/ath/ath11k/pci.c > >>>>> @@ -203,16 +203,18 @@ static void ath11k_pci_soc_global_reset(struct ath11k_base *ab) > >>>>> ath11k_warn(ab, "link down error during global reset\n"); > >>>>> } > >>>>> > >>>>> -static void ath11k_pci_clear_dbg_registers(struct ath11k_base *ab) > >>>>> +static void ath11k_pci_clear_dbg_registers(struct ath11k_base *ab, bool power_on) > >>>>> { > >>>>> - u32 val; > >>>>> + if (power_on) { > >>>>> + u32 val; > >>>>> > >>>>> - /* read cookie */ > >>>>> - val = ath11k_pcic_read32(ab, PCIE_Q6_COOKIE_ADDR); > >>>>> - ath11k_dbg(ab, ATH11K_DBG_PCI, "pcie_q6_cookie_addr 0x%x\n", val); > >>>>> + /* read cookie */ > >>>>> + val = ath11k_pcic_read32(ab, PCIE_Q6_COOKIE_ADDR); > >>>>> + ath11k_dbg(ab, ATH11K_DBG_PCI, "pcie_q6_cookie_addr 0x%x\n", val); > >>>>> > >>>>> - val = ath11k_pcic_read32(ab, WLAON_WARM_SW_ENTRY); > >>>>> - ath11k_dbg(ab, ATH11K_DBG_PCI, "wlaon_warm_sw_entry 0x%x\n", val); > >>>>> + val = ath11k_pcic_read32(ab, WLAON_WARM_SW_ENTRY); > >>>>> + ath11k_dbg(ab, ATH11K_DBG_PCI, "wlaon_warm_sw_entry 0x%x\n", val); > >>>>> + } > >>>>> > >>>>> /* TODO: exact time to sleep is uncertain */ > >>>>> mdelay(10); > >>>>> @@ -223,14 +225,18 @@ static void ath11k_pci_clear_dbg_registers(struct ath11k_base *ab) > >>>>> ath11k_pcic_write32(ab, WLAON_WARM_SW_ENTRY, 0); > >>>>> mdelay(10); > >>>>> > >>>>> - val = ath11k_pcic_read32(ab, WLAON_WARM_SW_ENTRY); > >>>>> - ath11k_dbg(ab, ATH11K_DBG_PCI, "wlaon_warm_sw_entry 0x%x\n", val); > >>>>> + if (power_on) { > >>>>> + u32 val; > >>>>> > >>>>> - /* A read clear register. clear the register to prevent > >>>>> - * Q6 from entering wrong code path. > >>>>> - */ > >>>>> - val = ath11k_pcic_read32(ab, WLAON_SOC_RESET_CAUSE_REG); > >>>>> - ath11k_dbg(ab, ATH11K_DBG_PCI, "soc reset cause %d\n", val); > >>>>> + val = ath11k_pcic_read32(ab, WLAON_WARM_SW_ENTRY); > >>>>> + ath11k_dbg(ab, ATH11K_DBG_PCI, "wlaon_warm_sw_entry 0x%x\n", val); > >>>>> + > >>>>> + /* A read clear register. clear the register to prevent > >>>>> + * Q6 from entering wrong code path. > >>>>> + */ > >>>>> + val = ath11k_pcic_read32(ab, WLAON_SOC_RESET_CAUSE_REG); > >>>>> + ath11k_dbg(ab, ATH11K_DBG_PCI, "soc reset cause %d\n", val); > >>>>> + } > >>>>> } > >>>>> > >>>>> static int ath11k_pci_set_link_reg(struct ath11k_base *ab, > >>>>> @@ -368,7 +374,7 @@ static void ath11k_pci_sw_reset(struct ath11k_base *ab, bool power_on) > >>>>> } > >>>>> > >>>>> ath11k_mhi_clear_vector(ab); > >>>>> - ath11k_pci_clear_dbg_registers(ab); > >>>>> + ath11k_pci_clear_dbg_registers(ab, power_on); > >>>>> ath11k_pci_soc_global_reset(ab); > >>>>> ath11k_mhi_set_mhictrl_reset(ab); > >>>>> } > >>>> > >>> > >> >