On Wed, Jun 11, 2025 at 05:43:16AM +0200, Christoph Hellwig wrote: > On Tue, Jun 10, 2025 at 09:37:30AM -0600, Keith Busch wrote: > > I may be out of the loop here. Is this an optimization to make something > > easier for the DMA layer? > > Yes. P2P that is based on a bus address (i.e. using a switch) uses > a completely different way to DMA MAP than the normal IOMMU or > direct mapping. So the optimization of collapsing all host physical > addresses into an iova can't work once it is present. > > > I don't think there's any fundamental reason > > why devices like nvme couldn't handle a command that uses memory mixed > > among multiple devices and/or host memory, at least. > > Sure, devices don't even see if an IOVA is P2P or not, this is all > host side. Sorry for my ignorant questions here, but I'm not sure how this setup (P2P transactions with switches and IOMMU enabled) actually works and would like to understand better. If I recall correctly, the PCIe ACS features will default redirect everything up to the root-complex when you have the IOMMU on. A device can set its memory request TLP's Address Type field to have the switch direct the transaction directly to a peer device instead, but how does the nvme device know how to set the it memory request's AT field? There's nothing that says a command's addresses are untranslated IOVAs vs translated peer addresses, right? Lacking some mechanism to specify what kind of address the nvme controller is dealing with, wouldn't you be forced to map peer addresses with the IOMMU, having P2P transactions make a round trip through it only using mapped IOVAs?