Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 01, 2025 at 03:48:26PM -0600, Alex Williamson wrote:
> Testing on some systems here...
> 
> I have an AMD system:

I have to admit I don't really like lspci -t, mostly because the man
page doesn't describe the notation '-[0000:00]- +-01.1-[01-03]' means
(I guess it is the subordinate bus range), and it drops any lableing
of the interior switch devices.

I've found lspci -PP to be alot easier to follow for this work:

 00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 02)
 00:06.0/02:00.0 Non-Volatile memory controller: SK hynix Platinum P41/PC801 NVMe Solid State Drive

However it is curious that doesn't include in the path

 00:00.0 Host bridge: Intel Corporation 12th Gen Core Processor Host Bridge/DRAM Registers (rev 02)

but lspci -t does..

> # lspci -tv
> -[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Root Complex
>            +-00.2  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge IOMMU
>            +-01.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon Pro W5700]
>            |                                            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
>            |                                            +-00.2  Advanced Micro Devices, Inc. [AMD/ATI] Device 7316
>            |                                            \-00.3  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 USB
>            +-01.2-[04]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
>            +-02.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-02.1-[05]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less)
>            +-02.2-[06-0b]----00.0-[07-0b]--+-01.0-[08]--+-00.0  MosChip Semiconductor Technology Ltd. MCS9922 PCIe Multi-I/O Controller
>            |                               |            \-00.1  MosChip Semiconductor Technology Ltd. MCS9922 PCIe Multi-I/O Controller
>            |                               +-02.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit Network Connection
>            |                               |               \-00.1  Intel Corporation 82576 Gigabit Network Connection
>            |                               \-03.0-[0b]----00.0  Fresco Logic FL1100 USB 3.0 Host Controller
>            +-03.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-03.1-[0c]----00.0  JMicron Technology Corp. JMB58x AHCI SATA controller
>            +-03.2-[0d]----00.0  Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller
>            +-04.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-08.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
>            +-08.1-[0e]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Raphael
>            |            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Radeon High Definition Audio Controller [Rembrandt/Strix]
>            |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP
>            |            +-00.3  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
>            |            +-00.4  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI
>            |            \-00.6  Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller
>            +-08.3-[0f]----00.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI
>            +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
>            +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
>            +-18.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 0
>            +-18.1  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 1
>            +-18.2  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 2
>            +-18.3  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 3
>            +-18.4  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 4
>            +-18.5  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 5
>            +-18.6  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 6
>            \-18.7  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 7
> 
> Notably, each case where there's a dummy host bridge followed by some
> number of additional functions (ie. 01.0, 02.0, 03.0, 08.0), that dummy
> host bridge is tainting the function isolation and merging the group.
> For instance each of these were previously a separate group and are now
> combined into one group.

Okay.. So what is this topology trying to represent and what should we
be doing in Linux here for groups?

I note that the spec left ACS flags for root ports as implementation
specific.. So I have no idea what this actually is trying to tell the
OS :\

> # lspci -vvvs 00:01. [manually edited]
> 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge
> 
> 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
> 	Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0
> 	Capabilities: [2a0 v1] Access Control Services
> 		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
> 		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 
> 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
> 	Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0
> 	Capabilities: [2a0 v1] Access Control Services
> 		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
> 		ACSCtl:	SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 
> The endpoints result in equivalent grouping, but this is a case where I
> don't understand how we have non-isolated functions yet isolated
> subordinate buses.

Sorry, I'm not sure I followed exactly, let me repeat what I think:

The new code is putting 00:01.0, 00:01.1, 00:01.2 in a group because
it is a MFD and not all functions in the MFD have ACS? This sounds
does sound correct? I would have expected the original code to do this
also? Why does it avoid it?

But then you mean 04:00.0 "Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983"
gets its own group?

I think this happens because the MFD code in pci_get_alias_group()
joins all functions together but does not set BUS_DATA_PCI_UNISOLATED
within the group. So the downstreams of the bridge remain isolated,
and the bus 00 was never NON_ISOLATED.

This does not seem right. Probably pci_get_alias_group() should be
setting BUS_DATA_PCI_UNISOLATED if the ACS is not isolated in the
function. I did not even slightly think about how a bridge USP on a
MFD would even work :\

> An Alder Lake system shows something similar:
> 
> # lspci -tv
> -[0000:00]-+-00.0  Intel Corporation 12th Gen Core Processor Host Bridge
>            +-01.0-[01-02]----00.0-[02]--
>            +-02.0  Intel Corporation Alder Lake-S GT1 [UHD Graphics 770]
>            +-04.0  Intel Corporation Alder Lake Innovation Platform Framework Processor Participant
>            +-06.0-[03]----00.0  Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less)
>            +-08.0  Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator
>            +-14.0  Intel Corporation Raptor Lake USB 3.2 Gen 2x2 (20 Gb/s) XHCI Host Controller
>            +-14.2  Intel Corporation Raptor Lake-S PCH Shared SRAM
>            +-15.0  Intel Corporation Raptor Lake Serial IO I2C Host Controller #0
>            +-15.1  Intel Corporation Raptor Lake Serial IO I2C Host Controller #1
>            +-15.2  Intel Corporation Raptor Lake Serial IO I2C Host Controller #2
>            +-15.3  Intel Corporation Device 7a4f
>            +-16.0  Intel Corporation Raptor Lake CSME HECI #1
>            +-17.0  Intel Corporation Raptor Lake SATA AHCI Controller
>            +-19.0  Intel Corporation Device 7a7c
>            +-19.1  Intel Corporation Device 7a7d
>            +-1a.0-[04]----00.0  Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less)
>            +-1c.0-[05]--
>            +-1c.1-[06]----00.0  Fresco Logic FL1100 USB 3.0 Host Controller
>            +-1c.2-[07]----00.0  Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller
>            +-1c.3-[08-0c]----00.0-[09-0c]--+-01.0-[0a]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
>            |                               +-02.0-[0b]--
>            |                               \-03.0-[0c]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
>            +-1f.0  Intel Corporation Device 7a06
>            +-1f.3  Intel Corporation Raptor Lake High Definition Audio Controller
>            +-1f.4  Intel Corporation Raptor Lake-S PCH SMBus Controller
>            \-1f.5  Intel Corporation Raptor Lake SPI (flash) Controller
> 
> 00:1c. are all grouped together.  Here 1c.0 does not report ACS, but
> the other root ports do:
> 
> # lspci -vvvs 1c. | grep -e ^0 -e "Access Control Services"
> 00:1c.0 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #1 (rev 11) (prog-if 00 [Normal decode])

So this is a PCI bridge not a host brdige like AMD.. 
What are the PCI types for this? Is it a root port?

> 00:1c.1 PCI bridge: Intel Corporation Device 7a39 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services
> 00:1c.2 PCI bridge: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services
> 00:1c.3 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #4 (rev 11) (prog-if 00 [Normal decode])
> 	Capabilities: [220 v1] Access Control Services

Same question, are these all root ports?

My desktop has:

00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 02) (prog-if 00 [Normal decode])
        Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00

So maybe yes?

> So again the group is tainted by a device that cannot generate DMA, the
> endpoint grouping remains equivalent, but isolated buses downstream of
> this non-isolated group doesn't seem to make sense.
> 
> I'll try to generate further interesting configs.  Thanks,

Thanks a lot, this is very different from my ARM systems here.

I really am at a bit of a loss what Linux should do here.. Your point
about a "device that cannot generate DMA" makes alot of sense. I've
thought the same way about the DSPs too.

Another thought - should we draw a line across the root ports and
assume that if a TLP reaches a root port/bridge that it goes the
IOMMU? Essentially we don't let the BUS_DATA_PCI_UNISOLATED of the
bus->self propogate if bus->self is a root port?

This would allow fixing the bridge MFD miss and still generate the
groupings you see here in a deliberate way.

Alternatively, should we try to identify these "no DMA" devices and
then improve the various ACS calculations? A device with no MMIO, no
IO, and no downstream can reasonably be considered to have no
DMA. Will that describe the two cases you saw that "spolied" the group?
We can detect that and enhance the ACS function to report they have ACS
RR/etc enabled.

I was thinking about this already in terms of the DSPs not really
needed to be group'd with their downstreams if they don't have MMIO
and can't initiate DMA.

To summarize:
 1) We are getting acceptable groupings for the downstream devices. So
    a lot is working well
 2) The root complex integrated devices, upstream of root ports, are
    not working the same way
 3) There is a miss of MFD ACS propogation on bridge/switch USPs

Also, I updated the github with an extra patch that has the debugging
I've been using. It may be helpful. It shows bus by bus what the
isolation is and various other decision points.

Thanks,
Jason




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux