On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote: > > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote: > > > > Hi all, > > > > > > > > I am writing to seek assistance with an issue we are experiencing with > > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root > > > > Port 1 to the host bridge. > > > > > > > > We have observed that after booting the system, the Base Address > > > > Register (BAR0) memory of this device gets reset to 0x0 after > > > > approximately one hour or more (the timing is inconsistent). This was > > > > verified using the lspci output and the setpci -s 01:00.0 > > > > BASE_ADDRESS_0 command. > > > > > > > > To diagnose the issue, we checked the dmesg log, but it did not > > > > provide any relevant information. I then enabled dynamic debugging for > > > > the PCI subsystem (drivers/pci/*) and noticed the following messages > > > > related ACPI hotplug in the dmesg log: > > > > > > > > [ 0.465144] pci 0000:01:00.0: reg 0x10: [mem 0xb0400000-0xb07fffff] > > > > ... > > > > [ 6710.000355] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > > [ 7916.250868] perf: interrupt took too long (4072 > 3601), lowering > > > > kernel.perf_event_max_sample_rate to 49000 > > > > [ 7984.719647] perf: interrupt took too long (5378 > 5090), lowering > > > > kernel.perf_event_max_sample_rate to 37000 > > > > [11051.409115] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > > [11755.388727] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > > [12223.885715] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > > [14303.465636] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > > > After these messages appear, reading the device BAR memory results in > > > > 0x0 instead of the expected value. > > > > > > > > I would like to understand the following: > > > > > > > > 1. What could be causing these hotplug_event debug messages? > > > > > > This is an ACPI Notify event. Basically the platform is telling us to > > > re-enumerate the hierarchy below RP01 because a device might have been > > > added or removed. > > > > Thank you for your response regarding the PCI BAR reset issue we are > > experiencing with the PLDA Device 5555. I have a few follow-up > > questions and additional information to share. > > > > 1. Clarification on "Platform": > > > > Does the term "platform" refer to the BIOS/ACPI subsystem in this context? > > Yes, "platform" refers to the BIOS/ACPI subsystem. > > > Can the platform signal to re-enumerate the hierarchy below RP01 > > without an actual device being removed or added? In our case, the PCI > > PLDA device is neither physically removed nor connected to the bus on > > the fly. > > Yes, I think a Bus Check notification is just a request for the OS to > re-enumerate starting at the point in the device tree where it is > notified. It's possible that no add or remove has occurred. ACPI > r6.5, sec 5.6.6, includes the example of hardware that can't detect > device changes during a system sleep state, so it issues a Bus Check > on wake. I booted with the pcie_aspm=off kernel parameter, which means that PCIe Active State Power Management (ASPM) is disabled. Given this context, should I consider removing this setting to see if it affects the occurrence of the Bus Check notifications and the BAR0 reset issue? > > > 2. System Configuration: > > > > We are currently using an x86_64 system with Ubuntu 20.04.6 LTS > > (kernel version: 5.4.0-148-generic). > > I have enabled dynamic debug logs for all files in the PCI and ACPI > > subsystems and rebooted the system with the following parameters: > > $ cat /proc/cmdline > > BOOT_IMAGE=/vmlinuz-5.4.0-148-generic root=/dev/mapper/vg00-rootvol ro > > quiet libata.force=noncq pci=nomsi pcie_aspm=off pcie_ports=on > > "dyndbg=file drivers/pci/* +p; file drivers/acpi/* +p" > > > > > > 3. Observations: > > > > After rebooting with more debug logs, I noticed the issue after 1 day, > > 11:48 hours. > > A snippet of the dmesg log is mentioned below (complete dmesg log is > > attached to this email): > > > > [128845.248503] ACPI: GPE event 0x01 > > [128845.356866] ACPI: \_SB_.PCI0.RP01: ACPI_NOTIFY_BUS_CHECK event > > [128845.357343] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in > > hotplug_event() > > If you could add more debug in hotplug_event() and the things it > calls, we might get more clues about what's happening. > > > 4. BAR Reset Issue: > > > > I filtered the lspci output to show the contents of the configuration > > space starting at offset 0x10 for getting BASE_ADDRESS_0 by running > > sudo lspci -xxx -s 01:00.0 | grep "10:". > > Prior to the BAR reset issue, the lspci output was: > > $ sudo lspci -xxx -s 01:00.0 | grep "10:" > > 10: 00 00 40 b0 00 00 00 00 00 00 00 00 00 00 00 00 > > > > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially > > showed all FF's, and then the next run of the same command showed > > BASE_ADDRESS_0 reset to zero: > > $ sudo lspci -xxx -s 01:00.0 | grep "10:" > > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > Looks like the device isn't responding at all here. Could happen if > the device is reset or powered down.