Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> writes: > Resetting resource is problematic as it prevent attempting to allocate > the resource later, unless something in between restores the resource. > Similarly, if fail_head does not contain all resources that were reset, > those resource cannot be restored later. > > The entire reset/restore cycle adds complexity and leaving resources > into reseted state causes issues to other code such as for checks done > in pci_enable_resources(). Take a small step towards not resetting > resources by delaying reset until the end of resource assignment and > build failure list (fail_head) in sync with the reset to avoid leaving > behind resources that cannot be restored (for the case where the caller > provides fail_head in the first place to allow restore somewhere in the > callchain, as is not all callers pass non-NULL fail_head). > > The Expansion ROM check is temporarily left in place while building the > failure list until the upcoming change which reworks optional resource > handling. > > Ideally, whole resource reset could be removed but doing that in a big > step would make the impact non-tractable due to complexity of all > related code. > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> Hi Ilpo, I'm seeing a crash on arm64 at boot that I bisected to this change. I don't think it's the same as the other issues reported. I've confirmed the crash is still there after your follow up patches. The crash itself is below[1]. It looks like the problem begins when: amdgpu_device_resize_fb_bar() pci_resize_resource() pci_reassign_bridge_resources() __pci_bus_size_bridges() adds pci_hotplug_io_size to `realloc_head`. The io resource allocation has failed earlier because the root port doesn't have an io window[2]. Then with this patch, pci_reassign_bridge_resources()'s call to __pci_bridge_assign_resources() now returns the io added space for hotplug in the `failed` list where the old code dropped it and did not. That sends pci_reassign_bridge_resources() into the `cleanup:` path, where I think the cleanup code doesn't properly release the resources that were assigned by __pci_bridge_assign_resources() and there's a conflict reported in pci_claim_resource() where a restored resource is found as conflicting with itself: > pcieport 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: can't claim; address conflict with PCI Bus 000d:01 [mem 0x340000000000-0x340017ffffff 64bit pref] Setting `pci=hpiosize=0` avoids this crash, as does this change: diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 16d5d390599a..59ece11702da 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -2442,7 +2442,7 @@ int pci_reassign_bridge_resources(struct pci_dev *bridge, unsigned long type) LIST_HEAD(saved); LIST_HEAD(added); LIST_HEAD(failed); - unsigned int i; + unsigned int i, relevant_fails; int ret; down_read(&pci_bus_sem); @@ -2490,7 +2490,16 @@ int pci_reassign_bridge_resources(struct pci_dev *bridge, unsigned long type) __pci_bridge_assign_resources(bridge, &added, &failed); BUG_ON(!list_empty(&added)); - if (!list_empty(&failed)) { + relevant_fails = 0; + list_for_each_entry(dev_res, &failed, list) { + restore_dev_resource(dev_res); + if (((dev_res->res->flags ^ type) & PCI_RES_TYPE_MASK) == 0) + relevant_fails++; + } + free_list(&failed); + + /* Cleanup if we had failures in resources of interest */ + if (relevant_fails != 0) { ret = -ENOSPC; goto cleanup; } @@ -2509,11 +2518,6 @@ int pci_reassign_bridge_resources(struct pci_dev *bridge, unsigned long type) return 0; cleanup: - /* Restore size and flags */ - list_for_each_entry(dev_res, &failed, list) - restore_dev_resource(dev_res); - free_list(&failed); - /* Revert to the old configuration */ list_for_each_entry(dev_res, &saved, list) { struct resource *res = dev_res->res; I don't know this code well enough to know if that changes is completely bonkers or what. Maybe a change that gets zero as pci_hotplug_io_size for root ports that don't have an io window would be better? Any other ideas, or other information about the crash that I could provide? Thanks, Scott [1]: Crash: > SError Interrupt on CPU0, code 0x00000000be000411 -- SError > CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.15.2+ #16 PREEMPT(undef) > Hardware name: Adlink Ampere Altra Developer Platform/Ampere Altra Developer Platform, BIOS TianoCore 2.09.100.00 (SYS: 2.10.20221028) 04/30/2 > Workqueue: events work_for_cpu_fn > pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : _raw_spin_lock_irqsave+0x34/0xb0 > lr : __wake_up+0x30/0x80 > sp : ffff800080003e20 > x29: ffff800080003e20 x28: ffff07ff808b0000 x27: 0000000000050260 > x26: 0000000000000001 x25: ffffcc5decad6bd8 x24: ffffcc5decc8b5b8 > x23: ffff080138a10000 x22: ffff080138a00000 x21: 0000000000000003 > x20: ffff080138a14dc8 x19: 0000000000000000 x18: 006808018e775f03 > x17: ffff3bc1330d2000 x16: ffffcc5e3a520ed8 x15: 8daad055c6e77021 > x14: f231631cf9328575 x13: 0e51168d06a6cf91 x12: f5db8c23b764520c > x11: 0000000000000040 x10: 0000000000000000 x9 : ffffcc5e3a520f08 > x8 : 0000000000002113 x7 : 0000000000000000 x6 : 0000000000000000 > x5 : 0000000000000000 x4 : ffff800080003e08 x3 : 00000000000000c0 > x2 : 0000000000000001 x1 : 0000000000000000 x0 : ffff080138a14dc8 > Kernel panic - not syncing: Asynchronous SError Interrupt > CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.15.2+ #16 PREEMPT(undef) > Hardware name: Adlink Ampere Altra Developer Platform/Ampere Altra Developer Platform, BIOS TianoCore 2.09.100.00 (SYS: 2.10.20221028) 04/30/2 > Workqueue: events work_for_cpu_fn > Call trace: > show_stack+0x30/0x98 (C) > dump_stack_lvl+0x7c/0xa0 > dump_stack+0x18/0x2c > panic+0x184/0x3c0 > nmi_panic+0x90/0xa0 > arm64_serror_panic+0x6c/0x90 > do_serror+0x58/0x60 > el1h_64_error_handler+0x38/0x60 > el1h_64_error+0x84/0x88 > _raw_spin_lock_irqsave+0x34/0xb0 (P) > amdgpu_ih_process+0xf0/0x150 [amdgpu] > amdgpu_irq_handler+0x34/0xa0 [amdgpu] > __handle_irq_event_percpu+0x60/0x248 > handle_irq_event+0x4c/0xc0 > handle_fasteoi_irq+0xa0/0x1c8 > handle_irq_desc+0x3c/0x68 > generic_handle_domain_irq+0x24/0x40 > __gic_handle_irq_from_irqson.isra.0+0x15c/0x260 > gic_handle_irq+0x28/0x80 > call_on_irq_stack+0x24/0x30 > do_interrupt_handler+0x88/0xa0 > el1_interrupt+0x44/0xd0 > el1h_64_irq_handler+0x18/0x28 > el1h_64_irq+0x84/0x88 > amdgpu_device_rreg.part.0+0x4c/0x190 [amdgpu] (P) > amdgpu_device_rreg+0x24/0x40 [amdgpu] > psp_wait_for+0x88/0xd8 [amdgpu] > psp_v11_0_bootloader_load_component+0x164/0x1b0 [amdgpu] > psp_v11_0_bootloader_load_kdb+0x20/0x40 [amdgpu] > psp_hw_start+0x5c/0x580 [amdgpu] > psp_load_fw+0x9c/0x280 [amdgpu] > psp_hw_init+0x44/0xa0 [amdgpu] > amdgpu_device_fw_loading+0xf8/0x358 [amdgpu] > amdgpu_device_ip_init+0x684/0x990 [amdgpu] > amdgpu_device_init+0xba4/0x1038 [amdgpu] > amdgpu_driver_load_kms+0x20/0xb8 [amdgpu] > amdgpu_pci_probe+0x1f8/0x7f8 [amdgpu] > local_pci_probe+0x44/0xb0 > work_for_cpu_fn+0x24/0x40 > process_one_work+0x17c/0x410 > worker_thread+0x254/0x388 > kthread+0x10c/0x128 > ret_from_fork+0x10/0x20 > Kernel Offset: 0x4c5dba380000 from 0xffff800080000000 > PHYS_OFFSET: 0x80000000 > CPU features: 0x0800,000042e0,01202650,8241720b > Memory Limit: none > ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- [2]: boot snippet > ACPI: PCI Root Bridge [PCI1] (domain 000d [bus 00-ff]) > acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3] > acpi PNP0A08:01: _OSC: platform does not support [PCIeHotplug PME LTR DPC] > acpi PNP0A08:01: _OSC: OS now controls [PCIeCapability] > acpi PNP0A08:01: FADT indicates ASPM is unsupported, using BIOS configuration > acpi PNP0A08:01: MCFG quirk: ECAM at [mem 0x37fff0000000-0x37ffffffffff] for [bus 00-ff] with pci_32b_read_ops > acpi PNP0A08:01: ECAM area [mem 0x37fff0000000-0x37ffffffffff] reserved by PNP0C02:00 > acpi PNP0A08:01: ECAM at [mem 0x37fff0000000-0x37ffffffffff] for [bus 00-ff] > PCI host bridge to bus 000d:00 > pci_bus 000d:00: root bus resource [mem 0x50000000-0x5fffffff window] > pci_bus 000d:00: root bus resource [mem 0x340000000000-0x37ffdfffffff window] > pci_bus 000d:00: root bus resource [bus 00-ff] > pci 000d:00:00.0: [1def:e100] type 00 class 0x060000 conventional PCI endpoint > pci 000d:00:01.0: [1def:e101] type 01 class 0x060400 PCIe Root Port > pci 000d:00:01.0: PCI bridge to [bus 01-03] > pci 000d:00:01.0: bridge window [io 0xe000-0xefff] > pci 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff] > pci 000d:00:01.0: bridge window [mem 0x340000000000-0x3400101fffff 64bit pref] > pci 000d:00:01.0: supports D1 D2 > pci 000d:00:01.0: PME# supported from D0 D1 D3hot > ... > pci 000d:00:01.0: bridge window [mem 0x340000000000-0x340017ffffff 64bit pref]: assigned > pci 000d:00:01.0: bridge window [mem 0x50000000-0x502fffff]: assigned > pci 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space > pci 000d:00:01.0: bridge window [io size 0x1000]: failed to assign > pci 000d:00:01.0: bridge window [io size 0x1000]: can't assign; no space > pci 000d:00:01.0: bridge window [io size 0x1000]: failed to assign