Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio

Keith Busch <kbusch@xxxxxxxxxx> · Wed, 11 Jun 2025 10:26:03 -0600

On Wed, Jun 11, 2025 at 05:43:16AM +0200, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 09:37:30AM -0600, Keith Busch wrote:
> > I may be out of the loop here. Is this an optimization to make something
> > easier for the DMA layer?
> 
> Yes.  P2P that is based on a bus address (i.e. using a switch) uses
> a completely different way to DMA MAP than the normal IOMMU or
> direct mapping.  So the optimization of collapsing all host physical
> addresses into an iova can't work once it is present.
> 
> > I don't think there's any fundamental reason
> > why devices like nvme couldn't handle a command that uses memory mixed
> > among multiple devices and/or host memory, at least.
> 
> Sure, devices don't even see if an IOVA is P2P or not, this is all
> host side.

Sorry for my ignorant questions here, but I'm not sure how this setup
(P2P transactions with switches and IOMMU enabled) actually works and
would like to understand better.

If I recall correctly, the PCIe ACS features will default redirect
everything up to the root-complex when you have the IOMMU on. A device
can set its memory request TLP's Address Type field to have the switch
direct the transaction directly to a peer device instead, but how does
the nvme device know how to set the it memory request's AT field?
There's nothing that says a command's addresses are untranslated IOVAs
vs translated peer addresses, right?  Lacking some mechanism to specify
what kind of address the nvme controller is dealing with, wouldn't you
be forced to map peer addresses with the IOMMU, having P2P transactions
make a round trip through it only using mapped IOVAs?