[PATCH v3 06/11] iommu: Compute iommu_groups properly for PCIe MFDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Like with switches the current MFD algorithm does not consider asymmetric
ACS within a MFD. If any MFD function has ACS that permits P2P the spec
says it can reach through the MFD internal loopback any other function in
the device.

For discussion let's consider a simple MFD topology like the below:

                      -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS
      Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS
                      |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS

This asymmetric ACS could be created using the config_acs kernel command
line parameter, from quirks, or from a poorly thought out device that has
ACS flags only on some functions.

Since ACS is an egress property the asymmetric flags allow for 00:1f.0 to
do memory acesses into 00:1f.6's BARs, but 00:1f.6 cannot reach any other
function. Thus we expect an iommu_group to contain all three
devices. Instead the current algorithm gives a group of [1f.0, 1f.2] and a
single device group of 1f.6.

The current algorithm sees the good ACS flags on 00:1f.6 and does not
consider ACS on any other MFD functions.

For path properties the ACS flags say that 00:1f.6 is safe to use with
PASID and supports SVA as it will not have any portions of its address
space routed away from the IOMMU, this part of the ACS system is working
correctly.

Further, if one of the MFD functions is a bridge, eg like 1f.2:

                      -- MFD 00:1f.0
      Root 00:00.00 --|- MFD 00:1f.2 Root Port --- 01:01.0
                      |- MFD 00:1f.6

Then the correct grouping will include 01:01.0, 00:1f.0/2/6 together in a
group if there is any internal loopback within the MFD 00:1f. The current
algorithm does not understand this and gives 01:01.0 it's own group even
if it thinks there is an internal loopback in the MFD.

Unfortunately this detail makes it hard to fix. Currently the code assumes
that any MFD without an ACS cap has an internal loopback which will cause
a large number of modern real systems to group in a pessimistic way.

However, the PCI spec does not really support this:

   PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function
   Devices

    ACS P2P Request Redirect: must be implemented by Functions that
    support peer-to-peer traffic with other Functions.

Meaning from a spec perspective the absence of ACS indicates the absence
of internal loopback. Granted I think we are aware of older real devices
that ignore this, but it seems to be the only way forward.

So, rely on 6.12.1.2 and assume functions without ACS do not have internal
loopback. This resolves the common issue with modern systems and MFD root
ports, but it makes the ACS quirks system less used. Instead we'd want
quirks that say self-loopback is actually present, not like today's quirks
that say it is absent. This is surely negative for older hardware, but
positive for new HW that complies with the spec.

Use pci_reachable_set() in pci_device_group() to make the resulting
algorithm faster and easier to understand.

Add pci_mfds_are_same_group() which specifically looks pair-wise at all
functions in the MFDs. Any function with ACS capabilities and non-isolated
aCS flags will become reachable to all other functions.

pci_reachable_set() does the calculations for figuring out the set of
devices under the pci_bus_sem, which is better than repeatedly searching
across all PCI devices.

Once the set of devices is determined and the set has more than one device
use pci_get_slot() to search for any existing groups in the reachable set.

Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface")
Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx>
---
 drivers/iommu/iommu.c | 189 +++++++++++++++++++-----------------------
 1 file changed, 87 insertions(+), 102 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 543d6347c0e5e3..fc3c71b243a850 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group)
 }
 EXPORT_SYMBOL_GPL(iommu_group_id);
 
-static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
-					       unsigned long *devfns);
-
-/*
- * For multifunction devices which are not isolated from each other, find
- * all the other non-isolated functions and look for existing groups.  For
- * each function, we also need to look for aliases to or from other devices
- * that may already have a group.
- */
-static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
-							unsigned long *devfns)
-{
-	struct pci_dev *tmp = NULL;
-	struct iommu_group *group;
-
-	if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED))
-		return NULL;
-
-	for_each_pci_dev(tmp) {
-		if (tmp == pdev || tmp->bus != pdev->bus ||
-		    PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) ||
-		    pci_acs_enabled(tmp, PCI_ACS_ISOLATED))
-			continue;
-
-		group = get_pci_alias_group(tmp, devfns);
-		if (group) {
-			pci_dev_put(tmp);
-			return group;
-		}
-	}
-
-	return NULL;
-}
-
-/*
- * Look for aliases to or from the given device for existing groups. DMA
- * aliases are only supported on the same bus, therefore the search
- * space is quite small (especially since we're really only looking at pcie
- * device, and therefore only expect multiple slots on the root complex or
- * downstream switch ports).  It's conceivable though that a pair of
- * multifunction devices could have aliases between them that would cause a
- * loop.  To prevent this, we use a bitmap to track where we've been.
- */
-static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
-					       unsigned long *devfns)
-{
-	struct pci_dev *tmp = NULL;
-	struct iommu_group *group;
-
-	if (test_and_set_bit(pdev->devfn & 0xff, devfns))
-		return NULL;
-
-	group = iommu_group_get(&pdev->dev);
-	if (group)
-		return group;
-
-	for_each_pci_dev(tmp) {
-		if (tmp == pdev || tmp->bus != pdev->bus)
-			continue;
-
-		/* We alias them or they alias us */
-		if (pci_devs_are_dma_aliases(pdev, tmp)) {
-			group = get_pci_alias_group(tmp, devfns);
-			if (group) {
-				pci_dev_put(tmp);
-				return group;
-			}
-
-			group = get_pci_function_alias_group(tmp, devfns);
-			if (group) {
-				pci_dev_put(tmp);
-				return group;
-			}
-		}
-	}
-
-	return NULL;
-}
-
 /*
  * Generic device_group call-back function. It just allocates one
  * iommu-group per device.
@@ -1534,44 +1455,108 @@ static struct iommu_group *pci_group_alloc_non_isolated(void)
 	return group;
 }
 
+/*
+ * All functions in the MFD need to be isolated from each other and get their
+ * own groups, otherwise the whole MFD will share a group.
+ */
+static bool pci_mfds_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
+{
+	/*
+	 * SRIOV VFs will use the group of the PF if it has
+	 * BUS_DATA_PCI_NON_ISOLATED. We don't support VFs that also have ACS
+	 * that are set to non-isolating.
+	 */
+	if (deva->is_virtfn || devb->is_virtfn)
+		return false;
+
+	/* Are deva/devb functions in the same MFD? */
+	if (PCI_SLOT(deva->devfn) != PCI_SLOT(devb->devfn))
+		return false;
+	/* Don't understand what is happening, be conservative */
+	if (deva->multifunction != devb->multifunction)
+		return true;
+	if (!deva->multifunction)
+		return false;
+
+	/*
+	 * PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and
+	 * Multi-Function Devices
+	 *   ...
+	 *   ACS P2P Request Redirect: must be implemented by Functions that
+	 *   support peer-to-peer traffic with other Functions.
+	 *
+	 * Therefore assume if a MFD has no ACS capability then it does not
+	 * support a loopback. This is the reverse of what Linux <= v6.16
+	 * assumed - that any MFD was capable of P2P and used quirks identify
+	 * devices that complied with the above.
+	 */
+	if (deva->acs_cap && !pci_acs_enabled(deva, PCI_ACS_ISOLATED))
+		return true;
+	if (devb->acs_cap && !pci_acs_enabled(devb, PCI_ACS_ISOLATED))
+		return true;
+	return false;
+}
+
+static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb)
+{
+	/*
+	 * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases.
+	 */
+	if (pci_devs_are_dma_aliases(deva, devb))
+		return true;
+
+	return pci_mfds_are_same_group(deva, devb);
+}
+
 /*
  * Return a group if the function has isolation restrictions related to
  * aliases or MFD ACS.
  */
 static struct iommu_group *pci_get_function_group(struct pci_dev *pdev)
 {
-	struct iommu_group *group;
-	DECLARE_BITMAP(devfns, 256) = {};
+	struct pci_reachable_set devfns;
+	const unsigned int NR_DEVFNS = sizeof(devfns.devfns) * BITS_PER_BYTE;
+	unsigned int devfn;
 
 	/*
-	 * Look for existing groups on device aliases.  If we alias another
-	 * device or another device aliases us, use the same group.
+	 * Look for existing groups on device aliases and multi-function ACS. If
+	 * we alias another device or another device aliases us, use the same
+	 * group.
+	 *
+	 * pci_reachable_set() should return the same bitmap if called for any
+	 * device in the set and we want all devices in the set to have the same
+	 * group.
 	 */
-	group = get_pci_alias_group(pdev, devfns);
-	if (group)
-		return group;
+	pci_reachable_set(pdev, &devfns, pci_devs_are_same_group);
+	/* start is known to have iommu_group_get() == NULL */
+	__clear_bit(pdev->devfn, devfns.devfns);
 
 	/*
-	 * Look for existing groups on non-isolated functions on the same
-	 * slot and aliases of those funcions, if any.  No need to clear
-	 * the search bitmap, the tested devfns are still valid.
-	 */
-	group = get_pci_function_alias_group(pdev, devfns);
-	if (group)
-		return group;
-
-	/*
-	 * When MFD's are included in the set due to ACS we assume that if ACS
-	 * permits an internal loopback between functions it also permits the
-	 * loopback to go downstream if a function is a bridge.
+	 * When MFD functions are included in the set due to ACS we assume that
+	 * if ACS permits an internal loopback between functions it also permits
+	 * the loopback to go downstream if any function is a bridge.
 	 *
 	 * It is less clear what aliases mean when applied to a bridge. For now
 	 * be conservative and also propagate the group downstream.
 	 */
-	__clear_bit(pdev->devfn & 0xFF, devfns);
-	if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE))
-		return pci_group_alloc_non_isolated();
-	return NULL;
+	if (bitmap_empty(devfns.devfns, NR_DEVFNS))
+		return NULL;
+
+	for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) {
+		struct iommu_group *group;
+		struct pci_dev *pdev_slot;
+
+		pdev_slot = pci_get_slot(pdev->bus, devfn);
+		group = iommu_group_get(&pdev_slot->dev);
+		pci_dev_put(pdev_slot);
+		if (group) {
+			if (WARN_ON(!(group->bus_data &
+				      BUS_DATA_PCI_NON_ISOLATED)))
+				group->bus_data |= BUS_DATA_PCI_NON_ISOLATED;
+			return group;
+		}
+	}
+	return pci_group_alloc_non_isolated();
 }
 
 /* Return a group if the upstream hierarchy has isolation restrictions. */
-- 
2.43.0





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux