On 10/04/2025 4:36 pm, Robin Murphy wrote:
On 09/04/2025 4:56 pm, Naresh Kamboju wrote:
On Wed, 2 Apr 2025 at 21:04, Robin Murphy <robin.murphy@xxxxxxx> wrote:
On 31/03/2025 5:03 am, Naresh Kamboju wrote:
Regressions on arm64 Juno-r2 devices detect SSD tests failed on the
Linux next and Linux mainline.
First seen on the v6.14-7245-g5c2a430e8599
Good: v6.14
Bad: v6.14-7422-gacb4f33713b9
Sorry, I can't seem to reproduce this on my end, both today's mainline
and acb4f33713b9 with my config, and even acb4f33713b9 with the linked
LKFT config, all work OK on my Juno r2 (using a SATA SSD and PCIe
networking). The only thing which stands out in your log is that PCI
seems to give up probing and assigning resources beyond the switch
downstream ports (so SATA and ethernet are never discovered), whereas on
mine it does[2]. However that all happens before the first IOMMU
instance probes (which conveniently is the PCIe one), so it's hard to
imagine how that could have an effect anyway...
The only obvious difference is that I'm using EDK2 rather than U-Boot,
so that's done all the PCIe configuration once already, but it doesn't
seem like that's significant - looking back at a random older log[1],
the on-board endpoints were still being picked up right after
reconfiguring the switch, well before the IOMMU comes into the picture.
Since it is a still issue on mainline and next,
Bisected and reverted patch ^ causing kernel warnings at boot time
but finding the SSD drive,
[bcb81ac6ae3c2ef95b44e7b54c3c9522364a245c]
iommu: Get DT/ACPI parsing into the proper probe path
pcieport 0000:00:00.0: late IOMMU probe at driver bind, something
fishy here!
WARNING: at drivers/iommu/iommu.c:559 __iommu_probe_device
I see boot warnings [1]
I am happy to test debug patches if you have any.
Seeing the warning after reverting the commit which introduced the
warning mostly just means the conflict resolution in the revert wasn't
right (there were some subsequent fixups...)
Anyway, I have now managed to get my Juno booting with the same antique
version of U-Boot and finally reproduce the issue. It seems to be
somehow connected to bus->dma_configure() being called in the
device_add() notifier (even though the rest of the IOMMU setup doesn't
run at that point since the driver hasn't registered yet), but how and
why that prevents the buses behind the switch downstream ports being
probed, and why *that* only happens when the switch isn't already
configured, remains a mystery so far. I'm still digging...
OK, I found it, but I'm still not sure what exactly to make of it - it's
the pci_request_acs() in of_iommu_configure(), now being called early
enough to actually have an effect. Booting with EDK2 already using PCI
prior to Linux, here's what I get for `sudo lspci -vv | grep ACSctl`
with 6.15-rc1:
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+
EgressCtrl- DirectTrans-
whereas with the 6.14 behaviour they are all '-'. I don't have a working
root filesystem with the U-Boot setup, but if I boot it with
"pci=config_acs=000000@pci:0:0" then the kernel does assign the bridge
windows and discover the ethernet/SATA endpoints again. I can spend some
time getting NFS working next week, but if you're able to get lspci
output off a machine in the "broken" state easily that would be handy to
compare.
So at this point it would seem to be something about how Linux
configures ACS when doing it from scratch. What I don't really know is
where to go from there. I do know Juno's possibly a bit odd in that the
switch supports ACS, but both the root port and endpoints either side of
it don't. Could this be tickling some subtle bug in the PCI layer, and
what is EDK2 doing that makes it not happen?
Thanks,
Robin.