Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Mon, 23 Jun 2025 17:58:35 -0500

Ping, any chance you could try the patch below to collect a little
more data.  The proposed quirk covers up a deeper problem, and I want
to fix that problem instead of covering it up.

On Tue, Jun 17, 2025 at 03:12:55PM -0500, Bjorn Helgaas wrote:
> [+cc Chaitanya, Ville for possible Intel NVMe contacts]
> 
> On Mon, Jun 16, 2025 at 09:38:25PM +0800, Hui Wang wrote:
> > On 6/16/25 19:55, Hui Wang wrote:
> > > On 6/13/25 00:48, Bjorn Helgaas wrote:
> > > > [+cc VMD folks]
> > > > 
> > > > On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
> > > > > Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
> > > > > Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
> > > > > that patch is merged to the kernel, this nvme stops working.
> > > > > 
> > > > > Through debugging, we found that commit introduces the RRS polling in
> > > > > the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
> > > > > will return ~0 if the config access is not ready yet, but the polling
> > > > > expects a return value of 0x0001 or a valid vendor_id, so the RRS
> > > > > polling doesn't work for this nvme.
> > > >
> > > > Sorry for breaking this, and thanks for all your work in debugging
> > > > this!  Issues like this are really hard to track down.
> > > > 
> > > > I would think we would have heard about this earlier if the NVMe
> > > > device were broken on all systems.  Maybe there's some connection with
> > > > VMD?  From the non-working dmesg log in your bug report
> > > > (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt):
> > > > 
> > > > 
> > > >    DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501
> > > > 04/17/2019
> > > >    vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
> > > >    pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
> > > >    pci 10000:00:02.0: PCI bridge to [bus 01]
> > > >    pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]:
> > > > assigned
> > > >    pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
> > > >    pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
> > > > 
> > > >    <I think vmd_enable_domain() calls pci_reset_bus() here>
> > > 
> > > Yes, and the pci_dev_wait() is called here. With the RRS polling, will
> > > get a ~0 from PCI_VENDOR_ID, then will get 0xfffffff when configuring
> > > the BAR0 subsequently. With the original polling method, it will get
> > > enough delay in the pci_dev_wait(), so nvme works normally.
> > > 
> > > The line "[   10.193589] hhhhhhhhhhhhhhhhhhhhhhhhhhhh dev->device = 0a54
> > > id = ffffffff" is output from pci_dev_wait(), please refer to
> > > https://launchpadlibrarian.net/798708446/LP2111521-dmesg-test9.txt
> > > 
> > > >    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > > >    pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 !=
> > > > 0xffffffff)
> > > >    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > > >    pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
> > > >    nvme nvme0: pci function 10000:01:00.0
> > > >    nvme 10000:01:00.0: enabling device (0000 -> 0002)
> > > > 
> > > > Things I notice:
> > > > 
> > > >    - The 10000:01:00.0 NVMe device is behind a VMD bridge
> > > > 
> > > >    - We successfully read the Vendor & Device IDs (8086:0a54)
> > > > 
> > > >    - The NVMe device is uninitialized.  We successfully sized the BAR,
> > > >      which included successful config reads and writes.  The BAR
> > > >      wasn't assigned by BIOS, which is normal since it's behind VMD.
> > > > 
> > > >    - We allocated space for BAR 0 but the config writes to program the
> > > >      BAR failed.  The read back from the BAR was 0xffffffff; probably a
> > > >      PCIe error, e.g., the NVMe device didn't respond.
> > > > 
> > > >    - The device *did* respond when nvme_probe() enabled it: the
> > > >      "enabling device (0000 -> 0002)" means pci_enable_resources() read
> > > >      PCI_COMMAND and got 0x0000.
> > > > 
> > > >    - The dmesg from the working config doesn't include the "enabling
> > > >      device" line, which suggests that pci_enable_resources() saw
> > > >      PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
> > > >      it again.  I don't know why it would already be set.
> > > >      d591f6804e7e really only changes pci_dev_wait(), which is used after
> > > >      device resets.  I think vmd_enable_domain() resets the VMD Root Ports
> > > >      after pci_scan_child_bus(), and maybe we're not waiting long enough
> > > >      afterwards.
> > > > 
> > > > My guess is that we got the ~0 because we did a config read too soon
> > > > after reset and the device didn't respond.  The Root Port would time
> > > > out, log an error, and synthesize ~0 data to complete the CPU read
> > > > (see PCIe r6.0, sec 2.3.2 implementation note).
> > > > 
> > > > It's *possible* that we waited long enough but the NVMe device is
> > > > broken and didn't respond when it should have, but my money is on a
> > > > software defect.
> > > > 
> > > > There are a few pci_dbg() calls about these delays; can you set
> > > > CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
> > > > collect that output?  Please also collect the "sudo lspci -vv" output
> > > > from a working system.
> > > 
> > > Already passed the testing request to bug reporters, wait for their
> > > feedback.
> > > 
> > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/55
> > 
> > This is the dmesg with dyndbg="file drivers/pci/* +p"
> > 
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/56
> 
> Thank you very much!  I'm stumped.
> 
> Both Ports here are capable of 8GT/s, and the Root Port is hot-plug
> capable and supports Data Link Layer Link Active Reporting but not DRS:
> 
>   10000:00:02.0 Intel Sky Lake-E PCI Express Root Port C
>     LnkCap: Port #11, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
>             ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
>     LnkSta: Speed 8GT/s, Width x4
>             TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
>     SltCap: AttnBtn- PwrCtrl- MRL- AttnInd+ PwrInd+ HotPlug+ Surprise+
>     LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
> 
>   10000:01:00.0 Intel NVMe Datacenter SSD
>     LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s <64ns
> 
> After a reset, we have to delay before doing config reads.  The wait
> time requirements are in PCIe r6.0, sec 6.6.1:
> 
>   • ... For cases where system software cannot determine that DRS is
>     supported by the attached device, or by the Downstream Port above
>     the attached device [as in this case]:
> 
>     ◦ ...
> 
>     ◦ With a Downstream Port that supports Link speeds greater than
>       5.0 GT/s, software must wait a minimum of 100 ms after Link
>       training completes before sending a Configuration Request to the
>       device immediately below that Port. Software can determine when
>       Link training completes by polling the Data Link Layer Link
>       Active bit or by setting up an associated interrupt (see §
>       Section 6.7.3.3). It is strongly recommended for software to
>       use this mechanism whenever the Downstream Port supports it.
> 
> Since the Port supports 8GT/s, we should wait 100ms after Link training
> completes (PCI_EXP_LNKSTA_DLLLA becomes set), and based on the code
> path below, I think we *do*:
> 
>   pci_bridge_secondary_bus_reset
>     pcibios_reset_secondary_bus
>       pci_reset_secondary_bus
>         # assert PCI_BRIDGE_CTL_BUS_RESET
>     pci_bridge_wait_for_secondary_bus(bridge, "bus reset")
>       pci_dbg("waiting %d ms for downstream link, after activation\n") # 10.86
>       delay = pci_bus_max_d3cold_delay()    # default 100ms
>       pcie_wait_for_link_delay(bridge, active=true, delay=100)
>         pcie_wait_for_link_status(use_lt=false, active=true)
>           end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT_MS
>           do {
>             pcie_capability_read_word(PCI_EXP_LNKSTA, &lnksta)
>             if ((lnksta & PCI_EXP_LNKSTA_DLLLA) == PCI_EXP_LNKSTA_DLLLA)
>               return 0
>             msleep(1)                       # likely wait several ms for link active
>           } while (time_before(jiffies, end_jiffies))
>         if (active)                         # (true)
>           msleep(delay)                     # wait 100ms here
>       pci_dev_wait(child, "bus reset", PCIE_RESET_READY_POLL_MS - delay)
>         delay = 1
>         for (;;) {
>           pci_read_config_dword(PCI_VENDOR_ID, &id)
>           if (!pci_bus_rrs_vendor_id(id))   # got 0xffff, assume valid
>             break;
>         }
>         pci_dbg("ready 0ms after bus reset", delay - 1)  # 11.11
> 
> And from the dmesg log:
> 
>   [   10.862226] pci 10000:00:02.0: waiting 100 ms for downstream link, after activation
>   [   11.109581] pci 10000:01:00.0: ready 0ms after bus reset
> 
> it looks like we waited about .25s (250ms) after the link came up
> before trying to read the Vendor ID, which should be plenty.
> 
> I guess maybe this device requires more than 250ms after reset before
> it can respond?  That still seems surprising to me; I *assume* Intel
> would have tested this for spec conformance.  But maybe there's some
> erratum.  I added a couple Intel folks who are mentioned in the nvme
> history in case they have pointers.
> 
> But it *is* also true that pci_dev_wait() doesn't know how to handle
> PCIe errors at all, and maybe there's a way to make it smarter.
> 
> Can you try adding the patch below and collect another dmesg log (with
> dyndbg="file drivers/pci/* +p" as before)?  This isn't a fix, but
> might give a little more insight into what's happening.
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e9448d55113b..42a36ff5c6cd 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1268,6 +1268,7 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  	bool retrain = false;
>  	struct pci_dev *root, *bridge;
>  
> +	pci_dbg(dev, "%s: %s timeout %d\n", __func__, reset_type, timeout);
>  	root = pcie_find_root_port(dev);
>  
>  	if (pci_is_pcie(dev)) {
> @@ -1305,14 +1306,32 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  
>  		if (root && root->config_rrs_sv) {
>  			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> -			if (!pci_bus_rrs_vendor_id(id))
> -				break;
> +			pci_dbg(dev, "%s: vf %d read %#06x\n", __func__,
> +				dev->is_virtfn, id);
> +			if (pci_bus_rrs_vendor_id(id))
> +				goto retry;
> +
> +			/*
> +			 * We might read 0xffff if the device is a VF and
> +			 * the read was successful (the VF Vendor ID is
> +			 * 0xffff per spec).
> +			 *
> +			 * If the device is not a VF, 0xffff likely means
> +			 * there was an error on PCIe.  E.g., maybe the
> +			 * device couldn't even respond with RRS status,
> +			 * and the RC timed out and synthesized ~0 data.
> +			 */
> +			if (PCI_POSSIBLE_ERROR(id) && !dev->is_virtfn)
> +				    goto retry;
> +
> +			break;
>  		} else {
>  			pci_read_config_dword(dev, PCI_COMMAND, &id);
>  			if (!PCI_POSSIBLE_ERROR(id))
>  				break;
>  		}
>  
> +retry:
>  		if (delay > timeout) {
>  			pci_warn(dev, "not ready %dms after %s; giving up\n",
>  				 delay - 1, reset_type);
> @@ -4760,6 +4779,8 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
>  	 * Some controllers might not implement link active reporting. In this
>  	 * case, we wait for 1000 ms + any delay requested by the caller.
>  	 */
> +	pci_dbg(pdev, "%s: active %d delay %d link_active_reporting %d\n",
> +		__func__, active, delay, pdev->link_active_reporting);
>  	if (!pdev->link_active_reporting) {
>  		msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
>  		return true;