On Sun, 15 Jun 2025 at 11:47, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > So yes, QCE seems to have only one queue, and even that one queue is *much* > slower than just using the CPU. It's even slower than the generic C code. Honestly, I have *NEVER* seen an external crypto accelerator that is worth using unless it's integrated with the target IO. Now, it's not my area of expertise either, so there may well be some random case that I haven't heard about, but the only sensible use-case I'm aware of is when the network card just does all the offloading and just does the whole SSL thing (or IPsec or whatever, but if you care about performance you'd be better off using wireguard and doing it all on the CPU anyway) And even then, people tend to not be happy with the results, because the hardware is too inflexible or too rare. (Replace "network card" with "disk controller" if that's your thing - the basic idea is the same: it's worthwhile if it's done natively by the IO target, not done by some third party accelerator - and while I'm convinced encryption on the disk controller makes sense, I'm not sure I'd actually *trust* it from a real cryptographic standpoint if you really care about it, because some of those are most definitely black boxes with the trust model seemingly being based on the "Trust me, Bro" approach to security). The other case is the "key is physically separate and isn't even under kernel control at all", but then it's never about performance in the first place (ie security keys etc). Even if the hardware crypto engine is fast - and as you see, no they aren't - any possible performance is absolutely killed by lack of caches and the IO overhead. This seems to also be pretty much true of async SMP crypto on the CPU as well. You can get better benchmarks by offloading the crypto to other CPU's, but I'm not convinced it's actually a good trade-off in reality. The cost of scheduling and just all the overhead of synchronization is very very real, and the benchmarks where it looks good tend to be the "we do nothing else, and we don't actually touch the data anyway, it's just purely about pointless benchmarking". Just the set-up costs for doing things asynchronously can be higher than the cost of just doing the operation itself. Linus