Re: [PATCH 2/2] PCI: brcmstb: Add panic/die handler to driver

Jim Quinlan <james.quinlan@xxxxxxxxxxxx> · Wed, 6 Aug 2025 15:16:07 -0400

On Wed, Aug 6, 2025 at 2:50 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Wed, Aug 06, 2025 at 02:38:12PM -0400, Jim Quinlan wrote:
> > On Wed, Aug 6, 2025 at 2:15 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Jun 13, 2025 at 06:08:43PM -0400, Jim Quinlan wrote:
> > > > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like,
> > > > by default Broadcom's STB PCIe controller effects an abort.  Some SoCs --
> > > > 7216 and its descendants -- have new HW that identifies error details.
> > >
> > > What's the long term plan for this?  This abort is a huge problem that
> > > we're seeing across arm64 platforms.  Forcing a panic and reboot for
> > > every uncorrectable error is pretty hard to deal with.
> >
> > Are you referring to STB/CM systems, Rpi, or something else altogether?
>
> Just in general.  I saw this recently with a Nuvoton NPCM8xx PCIe
> controller.  I'm not an arm64 guy, but I've been told that these
> aborts are basically unrecoverable from a kernel perspective.  For
> some reason several PCIe controllers intended for arm64 seem to raise
> aborts on PCIe errors.  At the moment, that means we can't recover
> from errors like surprise unplugs and other things that *should* be
> recoverable (perhaps at the cost of resetting or disabling a PCIe
> device).
FWIW, our original RC controller was paired with MIPs, so it could be
that a number of non-x86 camps just went with the panic-y behavior.

I believe that the PCIe spec allows this rude behavior, or doesn't
specifically disallow it.  I also remember that there is an ARM
standard initiative for ARM-based systems that requires the PCIe
error-gets-0xffffffff behavior.  We obviously don't conform.   At any
rate, I will send an email now to the HW folks I know to remind them
that we need this behavior, at least as a configurable option.

Regards,
Jim Quinlan
Broadcom STB/CM
>
> > > Is there a plan to someday recover from these aborts?  Or change the
> > > hardware so it can at least be configured to return ~0 data after
> > > logging the error in the hardware registers?
> >
> > Some of our upcoming chips will have the ability to do nothing on
> > errant PCIe writes and return 0xffffffff on errant PCIe reads.   But
> > none of our STB/CM chips do this currently.   I've been asking for
> > this behavior for years but I have limited influence on what happens
> > in HW.
>
> Fingers crossed for either that or some other way to make these things
> recoverable.
>
> > > > This simple handler determines if the PCIe controller was the
> > > > cause of the abort and if so, prints out diagnostic info.
> > > > Unfortunately, an abort still occurs.
> > > >
> > > > Care is taken to read the error registers only when the PCIe
> > > > bridge is active and the PCIe registers are acceptable.
> > > > Otherwise, a "die" event caused by something other than the PCIe
> > > > could cause an abort if the PCIe "die" handler tried to access
> > > > registers when the bridge is off.
> > >
> > > Checking whether the bridge is active is a "mostly-works"
> > > situation since it's always racy.
> >
> > I'm not sure I understand the "racy" comment.  If the PCIe bridge is
> > off, we do not read the PCIe error registers.  In this case, PCIe is
> > probably not the cause of the panic.   In the rare case the PCIe
> > bridge is off  and it was the PCIe that caused the panic, nothing
> > gets reported, and this is where we are without this commit.
> > Perhaps this is what you mean by "mostly-works".  But this is the
> > best that can be done with SW given our HW.
>
> Right, my fault.  The error report registers don't look like standard
> PCIe things, so I suppose they are on the host side, not the PCIe
> side, so they're probably guaranteed to be accessible and non-racy
> unless the bridge is in reset.
>
> Bjorn
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature