On Wed, Aug 6, 2025 at 2:50 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Wed, Aug 06, 2025 at 02:38:12PM -0400, Jim Quinlan wrote: > > On Wed, Aug 6, 2025 at 2:15 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > On Fri, Jun 13, 2025 at 06:08:43PM -0400, Jim Quinlan wrote: > > > > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like, > > > > by default Broadcom's STB PCIe controller effects an abort. Some SoCs -- > > > > 7216 and its descendants -- have new HW that identifies error details. > > > > > > What's the long term plan for this? This abort is a huge problem that > > > we're seeing across arm64 platforms. Forcing a panic and reboot for > > > every uncorrectable error is pretty hard to deal with. > > > > Are you referring to STB/CM systems, Rpi, or something else altogether? > > Just in general. I saw this recently with a Nuvoton NPCM8xx PCIe > controller. I'm not an arm64 guy, but I've been told that these > aborts are basically unrecoverable from a kernel perspective. For > some reason several PCIe controllers intended for arm64 seem to raise > aborts on PCIe errors. At the moment, that means we can't recover > from errors like surprise unplugs and other things that *should* be > recoverable (perhaps at the cost of resetting or disabling a PCIe > device). FWIW, our original RC controller was paired with MIPs, so it could be that a number of non-x86 camps just went with the panic-y behavior. I believe that the PCIe spec allows this rude behavior, or doesn't specifically disallow it. I also remember that there is an ARM standard initiative for ARM-based systems that requires the PCIe error-gets-0xffffffff behavior. We obviously don't conform. At any rate, I will send an email now to the HW folks I know to remind them that we need this behavior, at least as a configurable option. Regards, Jim Quinlan Broadcom STB/CM > > > > Is there a plan to someday recover from these aborts? Or change the > > > hardware so it can at least be configured to return ~0 data after > > > logging the error in the hardware registers? > > > > Some of our upcoming chips will have the ability to do nothing on > > errant PCIe writes and return 0xffffffff on errant PCIe reads. But > > none of our STB/CM chips do this currently. I've been asking for > > this behavior for years but I have limited influence on what happens > > in HW. > > Fingers crossed for either that or some other way to make these things > recoverable. > > > > > This simple handler determines if the PCIe controller was the > > > > cause of the abort and if so, prints out diagnostic info. > > > > Unfortunately, an abort still occurs. > > > > > > > > Care is taken to read the error registers only when the PCIe > > > > bridge is active and the PCIe registers are acceptable. > > > > Otherwise, a "die" event caused by something other than the PCIe > > > > could cause an abort if the PCIe "die" handler tried to access > > > > registers when the bridge is off. > > > > > > Checking whether the bridge is active is a "mostly-works" > > > situation since it's always racy. > > > > I'm not sure I understand the "racy" comment. If the PCIe bridge is > > off, we do not read the PCIe error registers. In this case, PCIe is > > probably not the cause of the panic. In the rare case the PCIe > > bridge is off and it was the PCIe that caused the panic, nothing > > gets reported, and this is where we are without this commit. > > Perhaps this is what you mean by "mostly-works". But this is the > > best that can be done with SW given our HW. > > Right, my fault. The error report registers don't look like standard > PCIe things, so I suppose they are on the host side, not the PCIe > side, so they're probably guaranteed to be accessible and non-racy > unless the bridge is in reset. > > Bjorn
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature