Re: [PATCH 6/6] dma-mapping: benchmark: add IOVA support

Robin Murphy <robin.murphy@xxxxxxx> · Wed, 21 May 2025 17:08:08 +0100

On 2025-05-20 11:39 pm, Luis Chamberlain wrote:
Add support to use the IOVA DMA API, and allow comparing and contrasting
to the streaming DMA API. Since the IOVA is intended to be an enhancement
when using an IOMMU which supports DMA over it only allow the IOVA to be
used proactively for devices which have this support, that is when
use_dma_iommu() is true. We don't try a fallback as the goal is clear,
only to use the IOVA when intended.

Example output, using intel-iommu on qemu against a random number
generator device, this output is completely artificial as its a VM
and its using more threads than the guest even has cores, the goal was
to at least visualize some numerical output on both paths:

./tools/testing/selftests/dma/dma_map_benchmark -t 24 -i 2
=== DMA Mapping Benchmark Results ===
Configuration: threads:24 seconds:20 node:-1 dir:BIDIRECTIONAL granule:1 iova:2
Buffer size: 1 pages (4 KB)

STREAMING DMA RESULTS:
   Map   latency:    12.3 μs (σ=257.9 μs)
   Unmap latency:     3.7 μs (σ=142.5 μs)
   Total latency:    16.0 μs

IOVA DMA RESULTS:
   Alloc   latency:     0.1 μs (σ= 31.1 μs)
   Link    latency:     2.5 μs (σ=116.9 μs)
   Sync    latency:     9.6 μs (σ=227.8 μs)
   Destroy latency:     3.6 μs (σ=141.2 μs)
   Total latency:    15.8 μs

PERFORMANCE COMPARISON:
   Streaming DMA total:    16.0 μs
   IOVA DMA total:         15.8 μs
   Performance ratio:      0.99x (IOVA is 1.3% faster)
   Streaming throughput:    62500 ops/sec
   IOVA throughput:         63291 ops/sec
   Streaming bandwidth:     244.1 MB/s
   IOVA bandwidth:          247.2 MB/s

IOVA OPERATION BREAKDOWN:
   Alloc:     0.6% (   0.1 μs)
   Link:     15.8% (   2.5 μs)
   Sync:     60.8% (   9.6 μs)
   Destroy:  22.8% (   3.6 μs)

RECOMMENDATIONS:
   ~ IOVA and Streaming APIs show similar performance
=== End of Benchmark ===

Signed-off-by: Luis Chamberlain <mcgrof@xxxxxxxxxx>
---
  include/linux/map_benchmark.h                 |  11 +
  kernel/dma/Kconfig                            |   4 +-
  kernel/dma/map_benchmark.c                    | 417 +++++++++++++++++-
  .../testing/selftests/dma/dma_map_benchmark.c | 145 +++++-
  4 files changed, 562 insertions(+), 15 deletions(-)

diff --git a/include/linux/map_benchmark.h b/include/linux/map_benchmark.h
index 62674c83bde4..da7c9e3ddf21 100644
--- a/include/linux/map_benchmark.h
+++ b/include/linux/map_benchmark.h
@@ -7,6 +7,7 @@
  #define _KERNEL_DMA_BENCHMARK_H
  
  #define DMA_MAP_BENCHMARK       _IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_BENCHMARK_IOVA	_IOWR('d', 2, struct map_benchmark)
  #define DMA_MAP_MAX_THREADS     1024
  #define DMA_MAP_MAX_SECONDS     300
  #define DMA_MAP_MAX_TRANS_DELAY (10 * NSEC_PER_MSEC)
@@ -27,5 +28,15 @@ struct map_benchmark {
  	__u32 dma_dir; /* DMA data direction */
  	__u32 dma_trans_ns; /* time for DMA transmission in ns */
  	__u32 granule;  /* how many PAGE_SIZE will do map/unmap once a time */
+	__u32 has_iommu_dma;

Why would userspace care about this? Either they asked for a streaming 
benchmark and it's irrelevant, or they asked for an IOVA benchmark, 
which either succeeded, or failed and this is ignored anyway.

+	__u64 avg_iova_alloc_100ns;
+	__u64 avg_iova_link_100ns;
+	__u64 avg_iova_sync_100ns;
+	__u64 avg_iova_destroy_100ns;
+	__u64 iova_alloc_stddev;
+	__u64 iova_link_stddev;
+	__u64 iova_sync_stddev;
+	__u64 iova_destroy_stddev;
+	__u32 use_iova; /* 0=regular, 1=IOVA, 2=both */

Conversely, why should the kernel have to care about this? If userspace 
wants both benchmarks, they can just run both benchmarks, with whatever 
number of threads for each they fancy. No need to have all that 
complexity kernel-side. If there's a valid desire for running multiple 
different benchmarks *simultaneously* then we should support that in 
general (I can imagine it being potentially interesting to thrash the 
IOVA allocator with several different sizes at once, for example.)

That way, I'd also be inclined to give the new ioctl its own separate 
structure for IOVA results, and avoid impacting the existing ABI.

  };
  #endif /* _KERNEL_DMA_BENCHMARK_H */
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 31cfdb6b4bc3..e2d5784f46eb 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -261,10 +261,10 @@ config DMA_API_DEBUG
  	  If unsure, say N.
  
  config DMA_MAP_BENCHMARK
-	bool "Enable benchmarking of streaming DMA mapping"
+	bool "Enable benchmarking of streaming and IOVA DMA mapping"
  	depends on DEBUG_FS
  	help
  	  Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
-	  performance of dma_(un)map_page.
+	  performance of the streaming DMA dma_(un)map_page and IOVA API.
  
  	  See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index b54345a757cb..3ae34433420b 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -18,6 +18,7 @@
  #include <linux/platform_device.h>
  #include <linux/slab.h>
  #include <linux/timekeeping.h>
+#include <linux/iommu-dma.h>

Nit: these are currently in nice alphabetical order.

  struct map_benchmark_data {
  	struct map_benchmark bparam;
[...]
@@ -112,7 +231,250 @@ static int map_benchmark_thread(void *data)
  	return ret;
  }
  
-static int do_map_benchmark(struct map_benchmark_data *map)
+static int do_iova_benchmark(struct map_benchmark_data *map)
+{
+	struct task_struct **tsk;
+	int threads = map->bparam.threads;
+	int node = map->bparam.node;
+	u64 iova_loops;
+	int ret = 0;
+	int i;
+
+	tsk = kmalloc_array(threads, sizeof(*tsk), GFP_KERNEL);
+	if (!tsk)
+		return -ENOMEM;
+
+	get_device(map->dev);
+
+	/* Create IOVA threads only */
+	for (i = 0; i < threads; i++) {
+		tsk[i] = kthread_create_on_node(benchmark_thread_iova, map,
+				node, "dma-iova-benchmark/%d", i);
+		if (IS_ERR(tsk[i])) {
+			pr_err("create dma_iova thread failed\n");
+			ret = PTR_ERR(tsk[i]);
+			while (--i >= 0)
+				kthread_stop(tsk[i]);
+			goto out;
+		}
+
+		if (node != NUMA_NO_NODE)
+			kthread_bind_mask(tsk[i], cpumask_of_node(node));
+	}

Duplicating all the thread-wrangling code seems needlessly horrible - 
surely it's easy enough to factor out the stats initialisation and final 
calculation, along with the thread function itself. Perhaps as callbacks 
in the map_benchmark_data?

Similarly, each "thread function" itself only only actually needs to 
consist of the respective "while (!kthread_should_stop())" loop - the 
rest of map_benchmark_thread() could still be used as a common harness 
to avoid duplicating the buffer management code as well.

Thanks,
Robin.