Hi,
Next, I plan to validate this performance gain on actual RISC-V hardware.
I've rerun my benchmarks, the cycles-per-byte results quoted in the poly1305-riscv.pl commentary section, and it appears that my U74 results were off. I must have made wrong assumptions about clock frequency or I failed to note that the [shared] system was busy. Either way, U74 delivers 1.8 cpb, be it the initial processor version or one with additional ISA capabilities such as Zbb, JH7100 vs. JH7110. For reference, the cpb is calculated by dividing the clock frequency by the measured MBps rate.
I also have vector implementation cooking. It's not ready to be released, because it doesn't yet scale with vlenb and works only on a 256-bit vector unit. It achieves 1.3 cpb on Spacemit X60, 2.5x improvement over scalar code. Just in case, one can't expect the coefficient to be the same on other processors.
Cheers.