The file 'poly1305-riscv.pl' is taken straight from this upstream GitHub repository [0] at commit 5e3fba73576244708a752fa61a8e93e587f271bb. This patch was tested on SpacemiT X60, with 2~2.5x improvement over generic implementation.
Just in case. The fact that the improvement coefficient is higher than one quoted in the poly1305-riscv.pl should not come as a surprise. Baselines are simply different. The Linux baseline is 9 widening multiplications, while the one used for the assembly module in question is 4 widening multiplications plus 2 non-widening ones.
A clarification to my previous message where I mentioned that I've adjusted benchmark results for U74. It turned out to be some weird power management thing that affected the initial readings.
I've also mentioned vector implementation being developed. It's now committed. On a related note, ChaCha20 vector implementation is also optimized, though it needs a little bit more work...
Cheers.