On 2025/7/15 3:19, Bjorn Helgaas wrote: > [+cc Hui] > > On Sat, Jul 12, 2025 at 07:30:28PM +0800, Weili Qian wrote: >> Prior to commit d591f6804e7e ("PCI: Wait for device readiness with >> Configuration RRS"), pci_dev_wait() polls PCI_COMMAND register until >> its value is not ~0(i.e., PCI_ERROR_RESPONSE). After d591f6804e7e, >> if the Configuration Request Retry Status Software Visibility (RRS SV) >> is enabled, pci_dev_wait() polls PCI_VENDOR_ID register until its value >> is not the reserved Vendor ID value 0x0001. >> >> On Kunpeng accelerator devices, RRS SV is enabled. However, >> when the virtual function's FLR (Function Level Reset) is not >> ready, the pci_dev_wait() reads the PCI_VENDOR_ID register and gets >> the value 0xffff instead of 0x0001. It then incorrectly assumes this >> is a valid Vendor ID and concludes the device is ready, returning >> successfully. In reality, the function may not be fully ready, leading >> to the device becoming unavailable. >> >> A 100ms wait period is already implemented before calling pci_dev_wait(). >> In most cases, FLR completes within 100ms. However, to eliminate the >> risk of function being unavailable due to an incomplete FLR, a >> device-specific reset is added. After pcie_flr(), the function continues >> to poll PCI_COMMAND register until its value is no longer ~0. > > As far as I can tell, there's nothing specific to Kungpeng devices > here. We've seen a similar issue with Intel NVMe devices [1], and I > don't want a whole mess of quirks and device-specific reset methods. > > We need some sort of generic solution for this. My understanding was > that if devices are not ready 100ms after a reset, they are required > to respond with RRS. Maybe these devices are defective. Or maybe my > understanding is incorrect. Either way, I think we should at least > check for a PCIe error before assuming that 0xffff is a valid > response. A generic solution for this would be even better. I tested the patch in [1], but it did not resolve the issue. According to the PCIe specification(e.g., Section 7.5.1 PCI-Compatible Configuration Registers), the Vendor ID and Device ID for VFs are 0xffff. However, if VFs cannot respond using the RRS mechanism, the pci_dev_wait() function reads the PCI_VENDOR_ID register and gets the value 0xffffffff. This value is considered an error on PCIe, causing pci_dev_wait() to continuously poll the register until a timeout occurs. Thanks, Weili > > [1] https://lore.kernel.org/linux-pci/20250611101442.387378-1-hui.wang@xxxxxxxxxxxxx/ > >> Fixes: d591f6804e7e ("PCI: Wait for device readiness with Configuration RRS") >> Signed-off-by: Weili Qian <qianweili@xxxxxxxxxx> >> --- >> drivers/pci/quirks.c | 36 ++++++++++++++++++++++++++++++++++++ >> 1 file changed, 36 insertions(+) >> >> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c >> index d7f4ee634263..1df1756257d2 100644 >> --- a/drivers/pci/quirks.c >> +++ b/drivers/pci/quirks.c >> @@ -4205,6 +4205,36 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe) >> return 0; >> } >> >> +#define KUNPENG_OPERATION_WAIT_CNT 3000 >> +#define KUNPENG_RESET_WAIT_TIME 20 >> + >> +/* Device-specific reset method for Kunpeng accelerator virtual functions */ >> +static int reset_kunpeng_acc_vf_dev(struct pci_dev *pdev, bool probe) >> +{ >> + u32 wait_cnt = 0; >> + u32 cmd; >> + >> + if (probe) >> + return 0; >> + >> + pcie_flr(pdev); >> + >> + do { >> + pci_read_config_dword(pdev, PCI_COMMAND, &cmd); >> + if (!PCI_POSSIBLE_ERROR(cmd)) >> + break; >> + >> + if (++wait_cnt > KUNPENG_OPERATION_WAIT_CNT) { >> + pci_warn(pdev, "wait for FLR ready timeout; giving up\n"); >> + return -ENOTTY; >> + } >> + >> + msleep(KUNPENG_RESET_WAIT_TIME); >> + } while (true); >> + >> + return 0; >> +} >> + >> static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { >> { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF, >> reset_intel_82599_sfp_virtfn }, >> @@ -4220,6 +4250,12 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { >> reset_chelsio_generic_dev }, >> { PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HINIC_VF, >> reset_hinic_vf_dev }, >> + { PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_ZIP_VF, >> + reset_kunpeng_acc_vf_dev }, >> + { PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_SEC_VF, >> + reset_kunpeng_acc_vf_dev }, >> + { PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_HPRE_VF, >> + reset_kunpeng_acc_vf_dev }, >> { 0 } >> }; >> >> -- >> 2.33.0 >> > . >