Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Tue, 17 Jun 2025 15:12:55 -0500

[+cc Chaitanya, Ville for possible Intel NVMe contacts]

On Mon, Jun 16, 2025 at 09:38:25PM +0800, Hui Wang wrote:
> On 6/16/25 19:55, Hui Wang wrote:
> > On 6/13/25 00:48, Bjorn Helgaas wrote:
> > > [+cc VMD folks]
> > > 
> > > On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
> > > > Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
> > > > Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
> > > > that patch is merged to the kernel, this nvme stops working.
> > > > 
> > > > Through debugging, we found that commit introduces the RRS polling in
> > > > the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
> > > > will return ~0 if the config access is not ready yet, but the polling
> > > > expects a return value of 0x0001 or a valid vendor_id, so the RRS
> > > > polling doesn't work for this nvme.
> > >
> > > Sorry for breaking this, and thanks for all your work in debugging
> > > this!  Issues like this are really hard to track down.
> > > 
> > > I would think we would have heard about this earlier if the NVMe
> > > device were broken on all systems.  Maybe there's some connection with
> > > VMD?  From the non-working dmesg log in your bug report
> > > (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt):
> > > 
> > > 
> > >    DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501
> > > 04/17/2019
> > >    vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
> > >    pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
> > >    pci 10000:00:02.0: PCI bridge to [bus 01]
> > >    pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]:
> > > assigned
> > >    pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
> > >    pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
> > > 
> > >    <I think vmd_enable_domain() calls pci_reset_bus() here>
> > 
> > Yes, and the pci_dev_wait() is called here. With the RRS polling, will
> > get a ~0 from PCI_VENDOR_ID, then will get 0xfffffff when configuring
> > the BAR0 subsequently. With the original polling method, it will get
> > enough delay in the pci_dev_wait(), so nvme works normally.
> > 
> > The line "[   10.193589] hhhhhhhhhhhhhhhhhhhhhhhhhhhh dev->device = 0a54
> > id = ffffffff" is output from pci_dev_wait(), please refer to
> > https://launchpadlibrarian.net/798708446/LP2111521-dmesg-test9.txt
> > 
> > >    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > >    pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 !=
> > > 0xffffffff)
> > >    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > >    pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
> > >    nvme nvme0: pci function 10000:01:00.0
> > >    nvme 10000:01:00.0: enabling device (0000 -> 0002)
> > > 
> > > Things I notice:
> > > 
> > >    - The 10000:01:00.0 NVMe device is behind a VMD bridge
> > > 
> > >    - We successfully read the Vendor & Device IDs (8086:0a54)
> > > 
> > >    - The NVMe device is uninitialized.  We successfully sized the BAR,
> > >      which included successful config reads and writes.  The BAR
> > >      wasn't assigned by BIOS, which is normal since it's behind VMD.
> > > 
> > >    - We allocated space for BAR 0 but the config writes to program the
> > >      BAR failed.  The read back from the BAR was 0xffffffff; probably a
> > >      PCIe error, e.g., the NVMe device didn't respond.
> > > 
> > >    - The device *did* respond when nvme_probe() enabled it: the
> > >      "enabling device (0000 -> 0002)" means pci_enable_resources() read
> > >      PCI_COMMAND and got 0x0000.
> > > 
> > >    - The dmesg from the working config doesn't include the "enabling
> > >      device" line, which suggests that pci_enable_resources() saw
> > >      PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
> > >      it again.  I don't know why it would already be set.
> > >      d591f6804e7e really only changes pci_dev_wait(), which is used after
> > >      device resets.  I think vmd_enable_domain() resets the VMD Root Ports
> > >      after pci_scan_child_bus(), and maybe we're not waiting long enough
> > >      afterwards.
> > > 
> > > My guess is that we got the ~0 because we did a config read too soon
> > > after reset and the device didn't respond.  The Root Port would time
> > > out, log an error, and synthesize ~0 data to complete the CPU read
> > > (see PCIe r6.0, sec 2.3.2 implementation note).
> > > 
> > > It's *possible* that we waited long enough but the NVMe device is
> > > broken and didn't respond when it should have, but my money is on a
> > > software defect.
> > > 
> > > There are a few pci_dbg() calls about these delays; can you set
> > > CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
> > > collect that output?  Please also collect the "sudo lspci -vv" output
> > > from a working system.
> > 
> > Already passed the testing request to bug reporters, wait for their
> > feedback.
> > 
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/55
> 
> This is the dmesg with dyndbg="file drivers/pci/* +p"
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/56

Thank you very much!  I'm stumped.

Both Ports here are capable of 8GT/s, and the Root Port is hot-plug
capable and supports Data Link Layer Link Active Reporting but not DRS:

  10000:00:02.0 Intel Sky Lake-E PCI Express Root Port C
    LnkCap: Port #11, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
            ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
    LnkSta: Speed 8GT/s, Width x4
            TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
    SltCap: AttnBtn- PwrCtrl- MRL- AttnInd+ PwrInd+ HotPlug+ Surprise+
    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-

  10000:01:00.0 Intel NVMe Datacenter SSD
    LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s <64ns

After a reset, we have to delay before doing config reads.  The wait
time requirements are in PCIe r6.0, sec 6.6.1:

  • ... For cases where system software cannot determine that DRS is
    supported by the attached device, or by the Downstream Port above
    the attached device [as in this case]:

    ◦ ...

    ◦ With a Downstream Port that supports Link speeds greater than
      5.0 GT/s, software must wait a minimum of 100 ms after Link
      training completes before sending a Configuration Request to the
      device immediately below that Port. Software can determine when
      Link training completes by polling the Data Link Layer Link
      Active bit or by setting up an associated interrupt (see §
      Section 6.7.3.3). It is strongly recommended for software to
      use this mechanism whenever the Downstream Port supports it.

Since the Port supports 8GT/s, we should wait 100ms after Link training
completes (PCI_EXP_LNKSTA_DLLLA becomes set), and based on the code
path below, I think we *do*:

  pci_bridge_secondary_bus_reset
    pcibios_reset_secondary_bus
      pci_reset_secondary_bus
        # assert PCI_BRIDGE_CTL_BUS_RESET
    pci_bridge_wait_for_secondary_bus(bridge, "bus reset")
      pci_dbg("waiting %d ms for downstream link, after activation\n") # 10.86
      delay = pci_bus_max_d3cold_delay()    # default 100ms
      pcie_wait_for_link_delay(bridge, active=true, delay=100)
        pcie_wait_for_link_status(use_lt=false, active=true)
          end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT_MS
          do {
            pcie_capability_read_word(PCI_EXP_LNKSTA, &lnksta)
            if ((lnksta & PCI_EXP_LNKSTA_DLLLA) == PCI_EXP_LNKSTA_DLLLA)
              return 0
            msleep(1)                       # likely wait several ms for link active
          } while (time_before(jiffies, end_jiffies))
        if (active)                         # (true)
          msleep(delay)                     # wait 100ms here
      pci_dev_wait(child, "bus reset", PCIE_RESET_READY_POLL_MS - delay)
        delay = 1
        for (;;) {
          pci_read_config_dword(PCI_VENDOR_ID, &id)
          if (!pci_bus_rrs_vendor_id(id))   # got 0xffff, assume valid
            break;
        }
        pci_dbg("ready 0ms after bus reset", delay - 1)  # 11.11

And from the dmesg log:

  [   10.862226] pci 10000:00:02.0: waiting 100 ms for downstream link, after activation
  [   11.109581] pci 10000:01:00.0: ready 0ms after bus reset

it looks like we waited about .25s (250ms) after the link came up
before trying to read the Vendor ID, which should be plenty.

I guess maybe this device requires more than 250ms after reset before
it can respond?  That still seems surprising to me; I *assume* Intel
would have tested this for spec conformance.  But maybe there's some
erratum.  I added a couple Intel folks who are mentioned in the nvme
history in case they have pointers.

But it *is* also true that pci_dev_wait() doesn't know how to handle
PCIe errors at all, and maybe there's a way to make it smarter.

Can you try adding the patch below and collect another dmesg log (with
dyndbg="file drivers/pci/* +p" as before)?  This isn't a fix, but
might give a little more insight into what's happening.

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e9448d55113b..42a36ff5c6cd 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1268,6 +1268,7 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 	bool retrain = false;
 	struct pci_dev *root, *bridge;
 
+	pci_dbg(dev, "%s: %s timeout %d\n", __func__, reset_type, timeout);
 	root = pcie_find_root_port(dev);
 
 	if (pci_is_pcie(dev)) {
@@ -1305,14 +1306,32 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 
 		if (root && root->config_rrs_sv) {
 			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
-			if (!pci_bus_rrs_vendor_id(id))
-				break;
+			pci_dbg(dev, "%s: vf %d read %#06x\n", __func__,
+				dev->is_virtfn, id);
+			if (pci_bus_rrs_vendor_id(id))
+				goto retry;
+
+			/*
+			 * We might read 0xffff if the device is a VF and
+			 * the read was successful (the VF Vendor ID is
+			 * 0xffff per spec).
+			 *
+			 * If the device is not a VF, 0xffff likely means
+			 * there was an error on PCIe.  E.g., maybe the
+			 * device couldn't even respond with RRS status,
+			 * and the RC timed out and synthesized ~0 data.
+			 */
+			if (PCI_POSSIBLE_ERROR(id) && !dev->is_virtfn)
+				    goto retry;
+
+			break;
 		} else {
 			pci_read_config_dword(dev, PCI_COMMAND, &id);
 			if (!PCI_POSSIBLE_ERROR(id))
 				break;
 		}
 
+retry:
 		if (delay > timeout) {
 			pci_warn(dev, "not ready %dms after %s; giving up\n",
 				 delay - 1, reset_type);
@@ -4760,6 +4779,8 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
 	 * Some controllers might not implement link active reporting. In this
 	 * case, we wait for 1000 ms + any delay requested by the caller.
 	 */
+	pci_dbg(pdev, "%s: active %d delay %d link_active_reporting %d\n",
+		__func__, active, delay, pdev->link_active_reporting);
 	if (!pdev->link_active_reporting) {
 		msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
 		return true;