On Thu, May 15, 2025 at 08:21:36PM +0100, David Laight wrote: > On Sun, 11 May 2025 16:07:50 -0700 > Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > On Sun, May 11, 2025 at 11:45:14PM +0200, Ard Biesheuvel wrote: > > > On Sun, 11 May 2025 at 23:22, Andrew Lunn <andrew@xxxxxxx> wrote: > > > > > > > > On Sun, May 11, 2025 at 10:29:29AM -0700, Eric Biggers wrote: > > > > > On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote: > > > > > > On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote: > > > > > > > Update networking code that computes the CRC32C of packets to just call > > > > > > > crc32c() without unnecessary abstraction layers. The result is faster > > > > > > > and simpler code. > > > > > > > > > > > > Hi Eric > > > > > > > > > > > > Do you have some benchmarks for these changes? > > > > > > > > > > > > Andrew > > > > > > > > > > Do you want benchmarks that show that removing the indirect calls makes things > > > > > faster? I think that should be fairly self-evident by now after dealing with > > > > > retpoline for years, but I can provide more details if you need them. > > > > > > > > I was think more like iperf before/after? Show the CPU load has gone > > > > down without the bandwidth also going down. > > > > > > > > Eric Dumazet has a T-Shirt with a commit message on the back which > > > > increased network performance by X%. At the moment, there is nothing > > > > T-Shirt quotable here. > > > > > > > > > > I think that removing layers of redundant code to ultimately call the > > > same core CRC-32 implementation is a rather obvious win, especially > > > when indirect calls are involved. The diffstat speaks for itself, so > > > maybe you can print that on a T-shirt. > > > > Agreed with Ard. I did try doing some SCTP benchmarks with iperf3 earlier, but > > they were very noisy and the CRC32C checksumming seemed to be lost in the noise. > > There probably are some tricks to running reliable networking benchmarks; I'm > > not a networking developer. Regardless, this series is a clear win for the > > CRC32C code, both from a simplicity and performance perspective. It also fixes > > the kconfig dependency issues. That should be good enough, IMO. > > > > In case it's helpful, here are some microbenchmarks of __skb_checksum (old) vs > > skb_crc32c (new): > > > > Linear sk_buffs > > > > Length in bytes __skb_checksum cycles skb_crc32c cycles > > =============== ===================== ================= > > 64 43 18 > > 1420 204 161 > > 16384 1735 1642 > > > > Nonlinear sk_buffs (even split between head and one fragment) > > > > Length in bytes __skb_checksum cycles skb_crc32c cycles > > =============== ===================== ================= > > 64 579 22 > > 1420 1506 194 > > 16384 4365 1682 > > > > So 1420-byte linear buffers (roughly the most common case) is 21% faster, > > 1420 bytes is unlikely to be the most common case - at least for some users. > SCTP is message oriented so the checksum is over a 'user message'. > A non-uncommon use is carrying mobile network messages (eg SMS) over the IP > network (instead of TDM links). > In that case the maximum data chunk size (what is being checksummed) is limited > to not much over 256 bytes - and a lot of data chunks will be smaller. > The actual difficulty is getting multiple data chunks into a single ethernet > packet without adding significant delays. > > But the changes definitely improve things. Interesting. Of course, the data I gave shows that the proportional performance increase is even greater on short packets than long ones. I'll include those tables when I resend the patchset and add a row for 256 bytes too. - Eric