Hi Eric, I recently ran a test using the Kunit module you wrote for testing poly1305, which I executed on QEMU RISC-V 64, . The results show a significant performance improvement of the optimized implementation compared to the generic one. The test data are as follows: --- base.log 2025-07-19 17:41:06.443392989 +0800 +++ optimized.log 2025-07-19 17:40:45.650048601 +0800 @@ -1,31 +1,31 @@ -[ 0.668631] # Subtest: poly1305 -[ 0.668774] # module: poly1305_kunit -[ 0.668857] 1..12 -[ 0.670267] ok 1 test_hash_test_vectors -[ 0.679479] ok 2 test_hash_all_lens_up_to_4096 -[ 0.696048] ok 3 test_hash_incremental_updates -[ 0.697645] ok 4 test_hash_buffer_overruns -[ 0.701060] ok 5 test_hash_overlaps -[ 0.702858] ok 6 test_hash_alignment_consistency -[ 0.703108] ok 7 test_hash_ctx_zeroization -[ 0.846150] ok 8 test_hash_interrupt_context_1 -[ 1.235247] ok 9 test_hash_interrupt_context_2 -[ 1.250813] ok 10 test_poly1305_allones_keys_and_message -[ 1.251138] ok 11 test_poly1305_reduction_edge_cases -[ 1.287196] # benchmark_hash: len=1: 2 MB/s -[ 1.305363] # benchmark_hash: len=16: 61 MB/s -[ 1.321102] # benchmark_hash: len=64: 212 MB/s -[ 1.340105] # benchmark_hash: len=127: 263 MB/s -[ 1.353880] # benchmark_hash: len=128: 364 MB/s -[ 1.370118] # benchmark_hash: len=200: 377 MB/s -[ 1.381879] # benchmark_hash: len=256: 570 MB/s -[ 1.394125] # benchmark_hash: len=511: 657 MB/s -[ 1.404265] # benchmark_hash: len=512: 794 MB/s -[ 1.413356] # benchmark_hash: len=1024: 985 MB/s -[ 1.421925] # benchmark_hash: len=3173: 1131 MB/s -[ 1.429956] # benchmark_hash: len=4096: 1218 MB/s -[ 1.438184] # benchmark_hash: len=16384: 1216 MB/s -[ 1.438462] ok 12 benchmark_hash -[ 1.438686] # poly1305: pass:12 fail:0 skip:0 total:12 -[ 1.438763] # Totals: pass:12 fail:0 skip:0 total:12 -[ 1.438904] ok 1 poly1305 +[ 0.666280] # Subtest: poly1305 +[ 0.666413] # module: poly1305_kunit +[ 0.666490] 1..12 +[ 0.667702] ok 1 test_hash_test_vectors +[ 0.672896] ok 2 test_hash_all_lens_up_to_4096 +[ 0.686244] ok 3 test_hash_incremental_updates +[ 0.687263] ok 4 test_hash_buffer_overruns +[ 0.689957] ok 5 test_hash_overlaps +[ 0.691393] ok 6 test_hash_alignment_consistency +[ 0.691622] ok 7 test_hash_ctx_zeroization +[ 0.769741] ok 8 test_hash_interrupt_context_1 +[ 0.930832] ok 9 test_hash_interrupt_context_2 +[ 0.940068] ok 10 test_poly1305_allones_keys_and_message +[ 0.940478] ok 11 test_poly1305_reduction_edge_cases +[ 0.964546] # benchmark_hash: len=1: 3 MB/s +[ 0.978836] # benchmark_hash: len=16: 78 MB/s +[ 0.990414] # benchmark_hash: len=64: 289 MB/s +[ 1.003012] # benchmark_hash: len=127: 397 MB/s +[ 1.012755] # benchmark_hash: len=128: 517 MB/s +[ 1.022928] # benchmark_hash: len=200: 603 MB/s +[ 1.030981] # benchmark_hash: len=256: 835 MB/s +[ 1.038706] # benchmark_hash: len=511: 1046 MB/s +[ 1.045233] # benchmark_hash: len=512: 1240 MB/s +[ 1.050733] # benchmark_hash: len=1024: 1638 MB/s +[ 1.055620] # benchmark_hash: len=3173: 1998 MB/s +[ 1.060247] # benchmark_hash: len=4096: 2132 MB/s +[ 1.064695] # benchmark_hash: len=16384: 2267 MB/s +[ 1.065179] ok 12 benchmark_hash +[ 1.065425] # poly1305: pass:12 fail:0 skip:0 total:12 +[ 1.065498] # Totals: pass:12 fail:0 skip:0 total:12 +[ 1.065612] ok 1 poly1305 Next, I plan to validate this performance gain on actual RISC-V hardware. I will also submit a v5 patch to the mailing list. Look forward to your feedback and suggestions. - Zhihang