Re: [RFC] PCI: pwrctrl and link-up dependencies

Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> · Wed, 16 Apr 2025 11:29:32 +0530

On Tue, Apr 15, 2025 at 11:24:39AM -0700, Brian Norris wrote:
> Hi,
> 
> On Tue, Apr 15, 2025 at 11:02:14AM +0530, Manivannan Sadhasivam wrote:
> > On Mon, Apr 14, 2025 at 04:24:38PM -0700, Brian Norris wrote:
> > > On Mon, Apr 14, 2025 at 04:27:35PM +0530, Manivannan Sadhasivam wrote:
> > > > On Tue, Apr 08, 2025 at 12:59:36PM -0700, Brian Norris wrote:
> > > > > For link startup, pcie-designware-host.c currently
> > > > > (a) starts the link via platform-specific means (dw_pcie::ops::start_link()) and
> > > > > (b) waits for the link training to complete.
> > > > > 
> > > > > However, (b) will fail if the other end of the link is not powered up --
> > > > > e.g., if the appropriate pwrctrl driver has not yet loaded, or its
> > > > > device hasn't finished probing. Today, this can mean the designware
> > > > > driver will either fail to probe,
> > > > 
> > > > This is not correct.
> > > 
> > > That depends on the implementation of start_link(). But I suppose the
> > > intention is that start_link() only "starts" and doesn't care where
> > > things go from there. (IOW, my local start_link() implementation is
> > > probably wrong at the moment, as it performs some poll/retry steps too.)
> > > 
> > 
> > The callback is supposed to just start
> > the link and not wait for anything else.
> 
> Ack, thanks. I've learned something.
> 
> > > > > or at least waste time for a condition
> > > > > that we can't achieve (link up), depending on the HW/driver
> > > > > implementation.
> > > > > 
> > > > 
> > > > Unfortunately we cannot avoid this waiting time as we don't know if a device is
> > > > attached to the bus or not. The 1s wait time predates my involvement with DWC
> > > > drivers.
> > > 
> > > I don't really love that answer. It means that any DWC-based platform
> > > that needs pwrctrl and doesn't set use_link_irq==true will waste 1
> > > second per PCIe controller. While it's hard to make guarantees about old
> > > and/or unloved drivers, I'd like to think I can do better on new ones.
> > > 
> > 
> > Even I'd like to avoid the 1s delay. But the problem is how would you know if
> > the device is attached to the bus or not. The delay is to account for the fact
> > that the link may take up to 1s to come up post starting LTSSM. So if we do not
> > wait for that period, there is a chance that we would report the false negative
> > status and also the enumeration would fail.
> 
> I understand there are cases we won't know, if we don't have a
> hotplug/presence-detect wiring. But for cases we know, I think it's
> cop-out to say "we can't handle it." See below.
> 
> > > Consider
> > > the lesser statement that was paired along with it: always wasting 1
> > > second per controller polling for something that will never happen. It
> > > feels backwards and wasteful.
> > > 
> > 
> > Again, I do get your point. But tell me how can a controller reliably detect
> > that there is a device attached to the bus. Only on your android setup, you for
> > sure know that the device won't be there during probe. So you are considering 1s
> > wait as a wast of time and it is fair. But what if the same controller is used
> > in another platform which is not android or the endpoint device is powered on
> > during probe itself without replying on userspace?
> 
> This has nothing to do with Android.
> 
> IMO, we have 3 main categories of setups that we should primarily care
> about:
> 
> (1) hotplug is supported, and PRSNT1#/PRSNT2# are wired
> (2) hotplug is not supported, but a device is present and is already
>     powered.
> (3) hotplug is not supported, but a device is present. the device
>     requires external power (i.e., pwrctrl / "subdevice regulators" /
>     etc., should be involved)
> 
> AFAICT, we don't have much of (1). But if we did, we should also be able
> to avoid initial delays, as we can reliably detect presence, and only
> wait for training when we know it should succeed. (Or even better,
> handle it async via an interrupt.)
> 
> For (2), we're also OK. The initial polling delay is likely to be much
> less than 1 second.
> 
> For (3) ... all non-pwrctrl drivers (pcie-brcmstb.c, pcie-tegra.c,
> pcie-tegra194.c, pcie-rockchip-host.c, ...) power things up before they
> configure ports, start LTSSM, and have any expectation of detecting a
> link. If a device is there, they again should commonly find it in much
> less than 1 second.
> 
> However, when *using* pwrctrl, we have the ordering all wrong (IMO), and
> so we eat needless delay. We *will not* successfully bring the link up,
> and we *won't* find the device. This smells like a design problem, where
> we have failed to plumb the information we already have available.
> 

I don't disagree :)

> I think you're too worried about a case (4): that hotplug is not
> supported, and a device is not present.
> 
> IMO, (4) should mostly be handled by simply disabling the unused
> controller in device tree, or living with the timeouts. If a platform
> doesn't support hotplug, then you can't expect optimal behavior for
> unplugged devices.
> 
> I'm not complaining about (4). I'm complaining about (3) with pwrctrl.
> 

Ok!

> > > One of my key questions: if I don't have a link-up IRQ, how can I avoid
> > > this waste? pcie-brcmstb avoids that waste today (for the common case
> > > where there is, in fact, a device connected), and it would be a
> > > regression for it to start using pwrctrl tomorrow.
> > > 
> > 
> > Why are you tying pwrctrl with this designware driver behavior? Both are
> > unrelated. Even if you don't use pwrctrl and use controller driver to bring up
> > the device, the 1s delay would be applicable (if there is no device).
> 
> We might be talking past each other. Per above, I think we can do better
> with (1)-(3). But you're bringing up (4). Problem (3) exists for all
> drivers, although it's more acute with DWC, and I happen to be using it.
> I also think it's indicative of larger design and ordering problems in
> pwrctrl.
> 

Now I get what you are saying.

> > pcie-brcmstb driver indeed wastes time. It is not 1s but just 100ms. But that
> > driver is for only one vendor. In the case of DWC, the driver has to work with
> > multiple vendors. But again, I do not know how this 1s delay came up. Maybe we
> > could try to reduce it to 500ms or so, but for that I need confirmation from
> > someone like Lorenzo who knows the history.
> > 
> > > (Side note: I also just noticed pcie-tegra194.c does the same.)
> > > 
> > > > > My guess is that (1) is the case, and specifically that the relevant folks are
> > > > > using the pcie-qcom.c, with its "global" IRQ used for link-up events.
> > > > > 
> > > > 
> > > > We only recently added support for 'Link Up' event through 'global_irq' in the
> > > > controller driver. And this was done to avoid waiting for link up during probe
> > > 
> > > You're kind of reinforcing my question: you don't like the waste, so
> > > you're adding link-up IRQ support -- is that really the only way?
> > > 
> > 
> > I don't know. But so far I haven't seen any other sensible way which is generic.
> > 
> > > (My initial thought: no, it's not. We know when pwrctrl has done its
> > > thing -- why should we bother polling for link state before that? But
> > > that's easier said than done, when pwrctrl is optional and highly
> > > abstracted away from the DWC driver...)
> > > 
> > 
> > Oh well... this is where you got it wrong. pwrctrl drivers are only probed
> > before enumeration because of the design (which is way after starting the link).
> > As of v6.15-rc1, before we try to enumerate any device, we check if there is any
> > device defined in DT which requires power supply. If so, we create a platform
> > device (or pwrctrl device) and let the pwrctrl driver to bind to it and power up
> > the device. In that case, we also do not proceed to scan the bus further and
> > skip the hierarchy. Because, the pwrctrl driver will rescan the bus once it has
> > finished powering up the device.
> 
> It sounds like you're saying "it's the way that it is, because of the
> way that it is." I understand how it is currently structured, but I'm
> saying that I think pwrctrl is placed at the wrong place. It looks cute
> and clean, but it has the ordering wrong.
> 
> IMO, we should allow pwrctrl to power things up earlier, so that
> controller drivers have a better chance of hitting the optimal cases
> (case (3) above) properly. (That's also how every pre-pwrctrl driver
> does things, and I think it's for good reason.)
> 
> That would also resolve my PERST# and other timing questions, because
> the controller driver would better know when pwrctrl is finished, and so
> can better handle PERST# and any necessary delays.
> 
> I agree this might be more difficult to do in a "generic" way (per your
> above language), depending on your definition of generic. But IMO, it's
> important to prioritize doing things correctly, even if it's slightly
> less cute.
> 
> As an example less-cute way of doing pwrctrl: expose a wrapped version
> of pci_pwrctrl_create_device() such that drivers can call it earlier. If
> there is a pwrctrl device created, that means a driver should not yet
> wait for link-up -- it should defer that until the relevant pwrctrl is
> marked "ready". (There are likely other problems to solve in here too,
> but this is just an initial sketch. And to be clear, I suspect this
> doesn't fit your notion of "generic", because it requires each driver to
> adapt to it.)
> 

This is what I initially had in my mind, but then I opted for a solution which
allowed the pwrctrl devices to be created in the PCI core itself without any
modifications in the controller drivers.

But I totally agree with you that now we don't have any control over PERST# and
that should be fixed.

> > > Regarding the controller design: frankly, I don't think my controller
> > > does anything all that revolutionary in this space [0]. All of my
> > > questions today can be asked (from what I can tell) of existing upstream
> > > controller drivers. I'm mostly trying to understand the expected driver
> > > design here, and that includes teasing apart what is "stuff done in
> > > 'old' drivers, but isn't recommended", and "what is currently
> > > unimplemented in new stuff" (like pwrctrl [1]), and where do my
> > > expectations fit in between that.
> > > 
> > > For instance, poking around a bit I come up with this question: when
> > > using pci/pwrctrl, how does one ensure timing requirements around, e.g.,
> > > power stability vs PERST# deassertion are met? When looking at a pwrctrl
> > > driver like drivers/pci/pwrctrl/slot.c, the process looks too simple:
> > > 
> > > (0) host bridge probably already started its LTSSM, deasserted PERST#
> > > (1) slot.c powers the slot
> > > (2) pci_pwrctrl_device_set_ready() -> rescan_work_func() rescans the bus
> > > 
> > > Notably, there's no enforced delay between (1) and (2).
> > > 
> > > Reading the PCIe CEM, it seems we're violating some specification bits,
> > > like:
> > > 
> > >   2.2. PERST# Signal
> > >   [...] On power-up, the de-assertion of PERST# is delayed 100 ms
> > >   (TPVPERL) from the power rails achieving specified operating limits.
> > >   [...]
> > > 
> > > There are references to this in various implementations (e.g.,
> > > tegra_pcie_enable_slot_regulators() and brcm_pcie_start_link() --
> > > although I suspect the latter is applying the wrong ordering).
> > > 
> > > Additionaly, CEM also seems to suggest we have PERST# ordering wrong. It
> > > should also come between (1) and (2), not at (0).
> > > 
> > 
> > You are absolutely right! Currently, we follow the timing requirement while
> > deasserting the PERST# in the controller drivers. But once we power on the slot,
> > we do not touch PERST# and it just happen to work.
> > 
> > We may need to introduce another callback that toggles PERST# so that we can use
> > it while powering up the device.
> > 
> > > And finally (for now), I don't understand how we have any guarantee that
> > > step (2) is useful. Even if we've already started the LTSSM in (0), we
> > > have no idea if the link is actually Active by the time we hit (2), and
> > > so rescanning may not actually discover the device. And if that scan
> > > fails ... then when do we trigger another pci_rescan_bus()? Only if the
> > > implementation has a "link-up" IRQ?
> > > 
> > 
> > As I said above, we do not enumerate the device if it has devicetree node with
> > supplies. So that's why we need (2). Otherwise, the device won't be enumerated
> > at all, unless userspace does the rescan (which defeats the purpose of pwrctrl).
> 
> But you haven't addressed one of the concerns in my paragraph: how do we
> know the link is up by the time we hit the pwrctrl-initiated
> pci_rescan_bus()? We haven't gone back to ask the host bridge if it's up
> yet. We're just hoping we get lucky.
> 
> IOW, the pwrctl sequence should be something like:
> 
>  (1) power up the slot
>  (1)(a) delay, per spec
>  (1)(b) deassert PERST#
>  (1)(c) wait for link up
>  (2) rescan bus
> 
> Currently, we skip all of (1)(a)-(c). We're probably lucky that (1)(b)'s
> ordering doesn't matter all the time, as long as we did it earlier. And
> we're lucky that there are natural delays in software such that lack of
> (1)(a) and (1)(c) aren't significant.
> 

Let me go back to the drawing board and come up with a proposal. There are
atleast a couple of ways to fix this issue and I need to pick a less intrusive
one.

Thanks for reporting it, appreciated!

- Mani

-- 
மணிவண்ணன் சதாசிவம்