Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > > Hi all,
> > > >
> > > > I am writing to seek assistance with an issue we are experiencing with
> > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root
> > > > Port 1 to the host bridge.
> > > >
> > > > We have observed that after booting the system, the Base Address
> > > > Register (BAR0) memory of this device gets reset to 0x0 after
> > > > approximately one hour or more (the timing is inconsistent). This was
> > > > verified using the lspci output and the setpci -s 01:00.0
> > > > BASE_ADDRESS_0 command.
> > > >
> > > > To diagnose the issue, we checked the dmesg log, but it did not
> > > > provide any relevant information. I then enabled dynamic debugging for
> > > > the PCI subsystem (drivers/pci/*) and noticed the following messages
> > > > related ACPI hotplug in the dmesg log:
> > > >
> > > > [    0.465144] pci 0000:01:00.0: reg 0x10: [mem 0xb0400000-0xb07fffff]
> > > > ...
> > > > [ 6710.000355] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > [ 7916.250868] perf: interrupt took too long (4072 > 3601), lowering
> > > > kernel.perf_event_max_sample_rate to 49000
> > > > [ 7984.719647] perf: interrupt took too long (5378 > 5090), lowering
> > > > kernel.perf_event_max_sample_rate to 37000
> > > > [11051.409115] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > [11755.388727] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > [12223.885715] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > [14303.465636] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > > > After these messages appear, reading the device BAR memory results in
> > > > 0x0 instead of the expected value.
> > > >
> > > > I would like to understand the following:
> > > >
> > > > 1. What could be causing these hotplug_event debug messages?
> > >
> > > This is an ACPI Notify event.  Basically the platform is telling us to
> > > re-enumerate the hierarchy below RP01 because a device might have been
> > > added or removed.
> >
> > Thank you for your response regarding the PCI BAR reset issue we are
> > experiencing with the PLDA Device 5555. I have a few follow-up
> > questions and additional information to share.
> >
> > 1. Clarification on "Platform":
> >
> > Does the term "platform" refer to the BIOS/ACPI subsystem in this context?
>
> Yes, "platform" refers to the BIOS/ACPI subsystem.
>
> > Can the platform signal to re-enumerate the hierarchy below RP01
> > without an actual device being removed or added? In our case, the PCI
> > PLDA device is neither physically removed nor connected to the bus on
> > the fly.
>
> Yes, I think a Bus Check notification is just a request for the OS to
> re-enumerate starting at the point in the device tree where it is
> notified.  It's possible that no add or remove has occurred.  ACPI
> r6.5, sec 5.6.6, includes the example of hardware that can't detect
> device changes during a system sleep state, so it issues a Bus Check
> on wake.
I booted with the pcie_aspm=off kernel parameter, which means that
PCIe Active State Power Management (ASPM) is disabled. Given this
context, should I consider removing this setting to see if it affects
the occurrence of the Bus Check notifications and the BAR0 reset
issue?

>
> > 2. System Configuration:
> >
> > We are currently using an x86_64 system with Ubuntu 20.04.6 LTS
> > (kernel version: 5.4.0-148-generic).
> > I have enabled dynamic debug logs for all files in the PCI and ACPI
> > subsystems and rebooted the system with the following parameters:
> > $ cat /proc/cmdline
> > BOOT_IMAGE=/vmlinuz-5.4.0-148-generic root=/dev/mapper/vg00-rootvol ro
> > quiet libata.force=noncq pci=nomsi pcie_aspm=off pcie_ports=on
> > "dyndbg=file drivers/pci/* +p; file drivers/acpi/* +p"
> >
> >
> > 3. Observations:
> >
> > After rebooting with more debug logs, I noticed the issue after 1 day,
> > 11:48 hours.
> > A snippet of the dmesg log is mentioned below (complete dmesg log is
> > attached to this email):
> >
> > [128845.248503] ACPI: GPE event 0x01
> > [128845.356866] ACPI: \_SB_.PCI0.RP01: ACPI_NOTIFY_BUS_CHECK event
> > [128845.357343] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in
> > hotplug_event()
>
> If you could add more debug in hotplug_event() and the things it
> calls, we might get more clues about what's happening.
>
> > 4. BAR Reset Issue:
> >
> > I filtered the lspci output to show the contents of the configuration
> > space starting at offset 0x10 for getting BASE_ADDRESS_0 by running
> > sudo lspci -xxx -s 01:00.0 | grep "10:".
> > Prior to the BAR reset issue, the lspci output was:
> > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > 10: 00 00 40 b0 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially
> > showed all FF's, and then the next run of the same command showed
> > BASE_ADDRESS_0 reset to zero:
> > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>
> Looks like the device isn't responding at all here.  Could happen if
> the device is reset or powered down.

[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux