On 10/06/2025 07.06, Christoph Hellwig wrote: > Use the blk_rq_dma_map API to DMA map requests instead of scatterlists. > This removes the need to allocate a scatterlist covering every segment, > and thus the overall transfer length limit based on the scatterlist > allocation. > > Instead the DMA mapping is done by iterating the bio_vec chain in the > request directly. The unmap is handled differently depending on how > we mapped: > > - when using an IOMMU only a single IOVA is used, and it is stored in > iova_state > - for direct mappings that don't use swiotlb and are cache coherent no > unmap is needed at all > - for direct mappings that are not cache coherent or use swiotlb, the > physical addresses are rebuild from the PRPs or SGL segments > > The latter unfortunately adds a fair amount of code to the driver, but > it is code not used in the fast path. > > The conversion only covers the data mapping path, and still uses a > scatterlist for the multi-segment metadata case. I plan to convert that > as soon as we have good test coverage for the multi-segment metadata > path. > > Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API > conversion for nvme-pci, Kanchan Joshi for bringing back the single > segment optimization, Leon Romanovsky for shepherding this through a > gazillion rebases and Nitesh Shetty for various improvements. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > --- > drivers/nvme/host/pci.c | 388 +++++++++++++++++++++++++--------------- > 1 file changed, 242 insertions(+), 146 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 04461efb6d27..2d3573293d0c 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -7,7 +7,7 @@ > #include <linux/acpi.h> > #include <linux/async.h> > #include <linux/blkdev.h> > -#include <linux/blk-mq.h> > +#include <linux/blk-mq-dma.h> > #include <linux/blk-integrity.h> > #include <linux/dmi.h> > #include <linux/init.h> > @@ -27,7 +27,6 @@ > #include <linux/io-64-nonatomic-lo-hi.h> > #include <linux/io-64-nonatomic-hi-lo.h> > #include <linux/sed-opal.h> > -#include <linux/pci-p2pdma.h> > > #include "trace.h" > #include "nvme.h" > @@ -46,13 +45,11 @@ > #define NVME_MAX_NR_DESCRIPTORS 5 > > /* > - * For data SGLs we support a single descriptors worth of SGL entries, but for > - * now we also limit it to avoid an allocation larger than PAGE_SIZE for the > - * scatterlist. > + * For data SGLs we support a single descriptors worth of SGL entries. > + * For PRPs, segments don't matter at all. > */ > #define NVME_MAX_SEGS \ > - min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \ > - (PAGE_SIZE / sizeof(struct scatterlist))) > + (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc)) The 8 MiB max transfer size is only reachable if host segments are at least 32k. But I think this limitation is only on the SGL side, right? Adding support to multiple SGL segments should allow us to increase this limit 256 -> 2048. Is this correct?