On Tue, Jun 10, 2025 at 04:37:58PM +0100, Robin Murphy wrote: > On 2025-06-09 7:45 pm, Nicolin Chen wrote: > > Hi all, > > > > Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS > > before initiating a Function Level Reset, and then ensure no invalidation > > requests being issued to a device when its ATS capability is disabled. > > Not really - what it says is that software should not expect to receive > invalidate completions from a function which is in the process of being > reset or powered off, and if software doesn't want to be confused by that > then it should take care to wait for completion or timeout of all > outstanding requests, and avoid issuing new requests, before initiating such > a reset or power transition. The commit message can be more precise, but I agree with the conclusion that the right direction for Linux is to disable and block ATS, instead of trying to ignore completion time out events, or trying to block page table mutations. Ie do what the implementation note says.. Maybe: PCIe permits a device to ignore ATS invalidation TLPs while it is processing FLR. This creates a problem visible to the OS where ATS invalidation commands will time out. For instance a SVA domain will have no coordination with a FLR event and can racily issue ATC invalidations into a resetting device. The OS should do something to mitigate this as we do not want production systems to be reporting critical ATS failures, especially in a hypervisor environment. Broadly the OS could arrange to ignore the timeout, block page table mutations to prevent invalidations, or disable and block ATS. The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and block ATS, and we already have iommu driver support to implement something like this. Implement this approach in the iommu core. Provide a callback from the PCI subsystem that will enclose the FLR and have the iommu core temporarily change all the domain attachments into BLOCKED. When attaching a BLOCKED domain IOMMU drivers should fence any incoming ATS queries, synchronously stop issuing new ATS invalidations, and synchronously wait for all ATS invalidations to complete. This will avoid any ATS invaliation time outs. IOMMU drivers may also disable ATS in PCI config space, but it is not required to solve the completion timeout problem. The PCI FLR logic will put all the iommu owned config space bits back before completing. During this period holding the group mutex will not allow new domains to be attached to prevent any new ATS invalidations. > > Therefore, there needs to be a sync between IOMMU and PCI subsystems, to > > ensure that ATS will be disabled and never gets re-enabled until the FLR > > finishes. Add a pair of new IOMMU helpers for PCI reset functions to call > > before and after the reset routines. These two helpers will temporally > > attach the device's RID/PASID to IOMMU_DOMAIN_BLOCKED, which should allow > > its IOMMU driver to pause any DMA traffic and disable ATS feature until > > the FLR is done. > > FLR must inherently stop the function from issuing any kind of requests > anyway (see 6.6.2), so "pausing" traffic at the IOMMU end seems like a > non-issue. Yeah.. I haven't liked this blocking domain approach too much, but I was convinced.. It has the strong advantage of being simple to implement and not requiring any special cases or code in the driver. We don't actually need the driver to disable ATS in PCIe config space, it only needs to stop sending invalidations and fail all new ATS requests. All existing drivers inherently do this already for blocked domains, so this should solve the problem for everyone. > I guess I can see how messing with the domain attachment > underneath the rest of the group manages to prevent new invalidate requests > from group->domain being issued to the given function, but it's pretty > horrid - leaving the mutex blocked might be just about tolerable for an FLR > that's supposed to take no longer than 100ms, but what if we do want to do > this for suspend/resume as well? I don't view this a problem for FLR, we can hold a mutex for a long time. It principally delays domain changes which are kind of nonsense to be doing concurrently with FLR in the first place.. However, for suspend, we probably want to leave a marker in the group that the group is force-blocked and all domain attach/detach logic will only update the group tracking structures and not call into the iommu driver. When the resume happens the core will set the current group domain list to the iommu driver. No need for a long lived lock this way. Jason