Re: [PATCH v2 00/16] Fix incorrect iommu_groups with PCIe ACS

Ethan Zhao <etzhao1900@xxxxxxxxx> · Fri, 8 Aug 2025 15:56:13 +0800

On 8/6/2025 10:41 AM, Baolu Lu wrote:
On 8/6/25 10:22, Ethan Zhao wrote:
On 8/5/2025 10:43 PM, Jason Gunthorpe wrote:
On Tue, Aug 05, 2025 at 10:41:03PM +0800, Ethan Zhao wrote:

My understanding, iommu has no logic yet to handle the egress control
vector configuration case,

We don't support it at all. If some FW leaves it configured then it
will work at the PCI level but Linux has no awarness of what it is
doing.

Arguably Linux should disable it on boot, but we don't..
linux tool like setpci could access PCIe configuration raw data, so
does to the ACS control bits. that is boring.

Any change to ACS after boot is "not supported" - iommu groups are one
time only using boot config only. If someone wants to customize ACS
they need to use the new config_acs kernel parameter.
That would leave ACS to boot time configuration only. Linux never
limits tools to access(write) hardware directly even it could do that.
Would it be better to have interception/configure-able policy for such
hardware access behavior in kernel like what hypervisor does to MSR etc ?

A root user could even clear the BME or MSE bits of a device's PCIe
configuration space, even if the device is already bound to a driver and
operating normally. I don't think there's a mechanism to prevent that
from happening, besides permission enforcement. I believe that the same
applies to the ACS control.

The static groups were created according to
FW DRDB tables,

?? iommu_groups have nothing to do with FW tables.
Sorry, typo, ACPI drhd table.

Same answer, AFAIK FW tables have no effect on iommu_groups 
My understanding, FW tables are part of the description about device 
topology and iommu-device relationship. did I really misunderstand
something ?

The ACPI/DMAR table describes the platform's IOMMU topology, not the
device topology, which is described by the PCI bus. So, the firmware
table doesn't impact the iommu_group.
There is kernel lockdown lsm works via kernel parameter lockdown=
{integrity | confidentiality},

"If set to integrity, kernel features that allow userland to modify the 
running kernel are disabled. If set to confidentiality, kernel features 
that allow userland to extract confidential information from the kernel
are also disabled. "

It also works for PCIe configuration space directly access from 
userland, but the levels of configuration granularity is coarse, can't 
configure kernel to only prevent PCI device from accessing from userland.

The design and implementation is structured and the granularity is fine-
grained, it is easy to extended and add new kernel parameter to only
lockdown PCI device configuration space access.

[LOCKDOWN_PCI_ACCESS] = "direct PCI access"

Thanks,
Ethan

Thanks,
baolu