On 2025-05-20 11:39 pm, Luis Chamberlain wrote:
Add support to use the IOVA DMA API, and allow comparing and contrasting
to the streaming DMA API. Since the IOVA is intended to be an enhancement
when using an IOMMU which supports DMA over it only allow the IOVA to be
used proactively for devices which have this support, that is when
use_dma_iommu() is true. We don't try a fallback as the goal is clear,
only to use the IOVA when intended.
Example output, using intel-iommu on qemu against a random number
generator device, this output is completely artificial as its a VM
and its using more threads than the guest even has cores, the goal was
to at least visualize some numerical output on both paths:
./tools/testing/selftests/dma/dma_map_benchmark -t 24 -i 2
=== DMA Mapping Benchmark Results ===
Configuration: threads:24 seconds:20 node:-1 dir:BIDIRECTIONAL granule:1 iova:2
Buffer size: 1 pages (4 KB)
STREAMING DMA RESULTS:
Map latency: 12.3 μs (σ=257.9 μs)
Unmap latency: 3.7 μs (σ=142.5 μs)
Total latency: 16.0 μs
IOVA DMA RESULTS:
Alloc latency: 0.1 μs (σ= 31.1 μs)
Link latency: 2.5 μs (σ=116.9 μs)
Sync latency: 9.6 μs (σ=227.8 μs)
Destroy latency: 3.6 μs (σ=141.2 μs)
Total latency: 15.8 μs
PERFORMANCE COMPARISON:
Streaming DMA total: 16.0 μs
IOVA DMA total: 15.8 μs
Performance ratio: 0.99x (IOVA is 1.3% faster)
Streaming throughput: 62500 ops/sec
IOVA throughput: 63291 ops/sec
Streaming bandwidth: 244.1 MB/s
IOVA bandwidth: 247.2 MB/s
IOVA OPERATION BREAKDOWN:
Alloc: 0.6% ( 0.1 μs)
Link: 15.8% ( 2.5 μs)
Sync: 60.8% ( 9.6 μs)
Destroy: 22.8% ( 3.6 μs)
RECOMMENDATIONS:
~ IOVA and Streaming APIs show similar performance
=== End of Benchmark ===
Signed-off-by: Luis Chamberlain <mcgrof@xxxxxxxxxx>
---
include/linux/map_benchmark.h | 11 +
kernel/dma/Kconfig | 4 +-
kernel/dma/map_benchmark.c | 417 +++++++++++++++++-
.../testing/selftests/dma/dma_map_benchmark.c | 145 +++++-
4 files changed, 562 insertions(+), 15 deletions(-)
diff --git a/include/linux/map_benchmark.h b/include/linux/map_benchmark.h
index 62674c83bde4..da7c9e3ddf21 100644
--- a/include/linux/map_benchmark.h
+++ b/include/linux/map_benchmark.h
@@ -7,6 +7,7 @@
#define _KERNEL_DMA_BENCHMARK_H
#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_BENCHMARK_IOVA _IOWR('d', 2, struct map_benchmark)
#define DMA_MAP_MAX_THREADS 1024
#define DMA_MAP_MAX_SECONDS 300
#define DMA_MAP_MAX_TRANS_DELAY (10 * NSEC_PER_MSEC)
@@ -27,5 +28,15 @@ struct map_benchmark {
__u32 dma_dir; /* DMA data direction */
__u32 dma_trans_ns; /* time for DMA transmission in ns */
__u32 granule; /* how many PAGE_SIZE will do map/unmap once a time */
+ __u32 has_iommu_dma;
Why would userspace care about this? Either they asked for a streaming
benchmark and it's irrelevant, or they asked for an IOVA benchmark,
which either succeeded, or failed and this is ignored anyway.
+ __u64 avg_iova_alloc_100ns;
+ __u64 avg_iova_link_100ns;
+ __u64 avg_iova_sync_100ns;
+ __u64 avg_iova_destroy_100ns;
+ __u64 iova_alloc_stddev;
+ __u64 iova_link_stddev;
+ __u64 iova_sync_stddev;
+ __u64 iova_destroy_stddev;
+ __u32 use_iova; /* 0=regular, 1=IOVA, 2=both */
Conversely, why should the kernel have to care about this? If userspace
wants both benchmarks, they can just run both benchmarks, with whatever
number of threads for each they fancy. No need to have all that
complexity kernel-side. If there's a valid desire for running multiple
different benchmarks *simultaneously* then we should support that in
general (I can imagine it being potentially interesting to thrash the
IOVA allocator with several different sizes at once, for example.)
That way, I'd also be inclined to give the new ioctl its own separate
structure for IOVA results, and avoid impacting the existing ABI.
};
#endif /* _KERNEL_DMA_BENCHMARK_H */
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 31cfdb6b4bc3..e2d5784f46eb 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -261,10 +261,10 @@ config DMA_API_DEBUG
If unsure, say N.
config DMA_MAP_BENCHMARK
- bool "Enable benchmarking of streaming DMA mapping"
+ bool "Enable benchmarking of streaming and IOVA DMA mapping"
depends on DEBUG_FS
help
Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
- performance of dma_(un)map_page.
+ performance of the streaming DMA dma_(un)map_page and IOVA API.
See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index b54345a757cb..3ae34433420b 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -18,6 +18,7 @@
#include <linux/platform_device.h>
#include <linux/slab.h>
#include <linux/timekeeping.h>
+#include <linux/iommu-dma.h>
Nit: these are currently in nice alphabetical order.
struct map_benchmark_data {
struct map_benchmark bparam;
[...]
@@ -112,7 +231,250 @@ static int map_benchmark_thread(void *data)
return ret;
}
-static int do_map_benchmark(struct map_benchmark_data *map)
+static int do_iova_benchmark(struct map_benchmark_data *map)
+{
+ struct task_struct **tsk;
+ int threads = map->bparam.threads;
+ int node = map->bparam.node;
+ u64 iova_loops;
+ int ret = 0;
+ int i;
+
+ tsk = kmalloc_array(threads, sizeof(*tsk), GFP_KERNEL);
+ if (!tsk)
+ return -ENOMEM;
+
+ get_device(map->dev);
+
+ /* Create IOVA threads only */
+ for (i = 0; i < threads; i++) {
+ tsk[i] = kthread_create_on_node(benchmark_thread_iova, map,
+ node, "dma-iova-benchmark/%d", i);
+ if (IS_ERR(tsk[i])) {
+ pr_err("create dma_iova thread failed\n");
+ ret = PTR_ERR(tsk[i]);
+ while (--i >= 0)
+ kthread_stop(tsk[i]);
+ goto out;
+ }
+
+ if (node != NUMA_NO_NODE)
+ kthread_bind_mask(tsk[i], cpumask_of_node(node));
+ }
Duplicating all the thread-wrangling code seems needlessly horrible -
surely it's easy enough to factor out the stats initialisation and final
calculation, along with the thread function itself. Perhaps as callbacks
in the map_benchmark_data?
Similarly, each "thread function" itself only only actually needs to
consist of the respective "while (!kthread_should_stop())" loop - the
rest of map_benchmark_thread() could still be used as a common harness
to avoid duplicating the buffer management code as well.
Thanks,
Robin.