Re: [PATCH 00/11] Fix incorrect iommu_groups with PCIe switches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 8 Jul 2025 17:47:15 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Tue, Jul 01, 2025 at 03:48:26PM -0600, Alex Williamson wrote:
> 
> > Notably, each case where there's a dummy host bridge followed by some
> > number of additional functions (ie. 01.0, 02.0, 03.0, 08.0), that dummy
> > host bridge is tainting the function isolation and merging the group.
> > For instance each of these were previously a separate group and are now
> > combined into one group.  
> 
> I was able to run some testing on a Milan system that seems similar.
> 
> It has the weird "Dummy Host Bridge" MFD. I fixed it with this:
> 
> /*
>  * For some reason AMD likes to put "dummy functions" in their PCI hierarchy as
>  * part of a multi function device. These are notable because they can't do
>  * anything. No BARs and no downstream bus. Since they cannot accept P2P or
>  * initiate any MMIO we consider them to be isolated from the rest of MFD. Since
>  * they often accompany a real PCI bridge with downstream devices it is
>  * important that the MFD be isolated. Annoyingly there is no ACS capability
>  * reported we have to special case it.
>  */
> static bool pci_dummy_function(struct pci_dev *pdev)
> {
> 	if (pdev->class >> 8 == PCI_CLASS_BRIDGE_HOST && !pci_has_mmio(pdev))
> 		return true;
> 	return false;
> }

Yeah, that might work since it does report itself as a host bridge.
Probably noteworthy that you'd end up catching the Intel host bridge
with this too.
 
> This AMD system has second weirdness:
> 
> 40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
>         Capabilities: [2a0 v1] Access Control Services
>                 ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
>                 ACSCtl: SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 40:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
>         Capabilities: [2a0 v1] Access Control Services
>                 ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
>                 ACSCtl: SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 40:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
>         Capabilities: [2a0 v1] Access Control Services
>                 ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
>                 ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
> 
> Notice the SrcValid- 
> 
> The kernel definately set SrcValid+, the device stored it, and it
> never set SrcValid-, yet somehow it got changed:
> 
> [    0.483828] pci 0000:40:01.1: pci_enable_acs:1089
> [    0.483828] pci 0000:40:01.1: pci_write_config_word:604 9 678 = 1d
> [    0.483831] pci 0000:40:01.1: ACS Set to 1d, readback=1d
> [..]
> [    0.826514] pci 0000:40:01.1: __pci_device_group:1635 Starting
> [    0.826517] pci 0000:40:01.1: pci_acs_flags_enabled:3668   ctrl=1c acs_flags=1d cap=5f
> 
> I instrumented pci_write_config_word() and it isn't being called a
> second time. I didn't try to narrow this down, too weird. Guessing
> ACPI or FW?
> 
> So the new logic puts all the above and the downstream into group due
> to insuffucient isolation which is the only degredation on this
> system, the LOM ethernet gets grouped together with the above MFD.
> 
> Given in this case we explicitly have ACS flags we consider
> non-isolated I'm not sure there is anything to be done about it.
> 
> Which raises a question if SrcValid should be part of grouping or not,
> it is more of a security enhancement, it doesn't permit/deny P2P
> between devices?

Strange issue.  If a device can spoof a RID then it can theoretically
inject a DMA payload as if it were another device.  That seems like
basic security, not just an enhancement.
 
> > The endpoints result in equivalent grouping, but this is a case where I
> > don't understand how we have non-isolated functions yet isolated
> > subordinate buses.  
> 
> And I fixed this too, as above is showing, by marking the group of the
> MFD as non-isolated, thus forcing it to propogate downstream.
> 
> > An Alder Lake system shows something similar:  
> 
> I also tested a bunch of Intel client systems. Some with an ACS quirk
> and one with the VMD/non transparent bridge setup. Those had no
> grouping changes, but no raptor lake in this group.
> 
> > # lspci -vvvs 1c. | grep -e ^0 -e "Access Control Services"
> > 00:1c.0 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #1 (rev 11) (prog-if 00 [Normal decode])
> > 00:1c.1 PCI bridge: Intel Corporation Device 7a39 (rev 11) (prog-if 00 [Normal decode])
> > 	Capabilities: [220 v1] Access Control Services
> > 00:1c.2 PCI bridge: Intel Corporation Raptor Point-S PCH - PCI Express Root Port 3 (rev 11) (prog-if 00 [Normal decode])
> > 	Capabilities: [220 v1] Access Control Services
> > 00:1c.3 PCI bridge: Intel Corporation Raptor Lake PCI Express Root Port #4 (rev 11) (prog-if 00 [Normal decode])
> > 	Capabilities: [220 v1] Access Control Services
> > 
> > So again the group is tainted by a device that cannot generate DMA,   
> 
> It looks like 00:1c.0 is advertised as a root port, so it can generate
> DMA as part of its root port function bridging to something outside
> the root complex.
> 
> This system doesn't seem to have anything downstream of that root port
> (currently plugged in?), but IMHO that port should have ACS. By spec I
> think it is correct to assume that without ACS traffic from downstream
> of the root port would be able to follow the internal loopback of the
> MFD.
> 
> This will probably need a quirk, and it is different from the AMD case
> which used a host bridge..
> 
> Any other idea?

This root port at 1c.0 does look like it could have a subordinate
device, but there is no unpopulated slot/socket on this motherboard.
Possibly another motherboard SKU could use these links for a wifi card.
Versus the other root ports, it lacks routing of its interrupt line, it
does have a secondary bus and apertures assigned, it has a PCIe
capability that claims it has a slot (LnkCap 8GT/s x1), it has an MSI
capability and a NULL capability, no extended config space.  I'm not
sure if this vendor (Gigabyte) is unique in incompletely stubbing out
this link or if this is par for the course.

Again, it should be perfectly safe to assign things downstream of the
ACS isolated root ports in the MFD to userspace drivers, their egress
DMA is isolated.  It would only be ingress from an endpoint that seems
like it cannot exist that would be troublesome.  I don't have a good
solution.  Thanks,

Alex





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux