Re: [PATCH net-next 00/10] net: faster and simpler CRC32C computation

Eric Biggers <ebiggers@xxxxxxxxxx> · Thu, 15 May 2025 12:50:51 -0700

On Thu, May 15, 2025 at 08:21:36PM +0100, David Laight wrote:
> On Sun, 11 May 2025 16:07:50 -0700
> Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> 
> > On Sun, May 11, 2025 at 11:45:14PM +0200, Ard Biesheuvel wrote:
> > > On Sun, 11 May 2025 at 23:22, Andrew Lunn <andrew@xxxxxxx> wrote:  
> > > >
> > > > On Sun, May 11, 2025 at 10:29:29AM -0700, Eric Biggers wrote:  
> > > > > On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote:  
> > > > > > On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote:  
> > > > > > > Update networking code that computes the CRC32C of packets to just call
> > > > > > > crc32c() without unnecessary abstraction layers.  The result is faster
> > > > > > > and simpler code.  
> > > > > >
> > > > > > Hi Eric
> > > > > >
> > > > > > Do you have some benchmarks for these changes?
> > > > > >
> > > > > >     Andrew  
> > > > >
> > > > > Do you want benchmarks that show that removing the indirect calls makes things
> > > > > faster?  I think that should be fairly self-evident by now after dealing with
> > > > > retpoline for years, but I can provide more details if you need them.  
> > > >
> > > > I was think more like iperf before/after? Show the CPU load has gone
> > > > down without the bandwidth also going down.
> > > >
> > > > Eric Dumazet has a T-Shirt with a commit message on the back which
> > > > increased network performance by X%. At the moment, there is nothing
> > > > T-Shirt quotable here.
> > > >  
> > > 
> > > I think that removing layers of redundant code to ultimately call the
> > > same core CRC-32 implementation is a rather obvious win, especially
> > > when indirect calls are involved. The diffstat speaks for itself, so
> > > maybe you can print that on a T-shirt.  
> > 
> > Agreed with Ard.  I did try doing some SCTP benchmarks with iperf3 earlier, but
> > they were very noisy and the CRC32C checksumming seemed to be lost in the noise.
> > There probably are some tricks to running reliable networking benchmarks; I'm
> > not a networking developer.  Regardless, this series is a clear win for the
> > CRC32C code, both from a simplicity and performance perspective.  It also fixes
> > the kconfig dependency issues.  That should be good enough, IMO.
> > 
> > In case it's helpful, here are some microbenchmarks of __skb_checksum (old) vs
> > skb_crc32c (new):
> > 
> >     Linear sk_buffs
> > 
> >         Length in bytes    __skb_checksum cycles    skb_crc32c cycles
> >         ===============    =====================    =================
> >                      64                       43                   18
> >                    1420                      204                  161
> >                   16384                     1735                 1642
> > 
> >     Nonlinear sk_buffs (even split between head and one fragment)
> > 
> >         Length in bytes    __skb_checksum cycles    skb_crc32c cycles
> >         ===============    =====================    =================
> >                      64                      579                   22
> >                    1420                     1506                  194
> >                   16384                     4365                 1682
> > 
> > So 1420-byte linear buffers (roughly the most common case) is 21% faster,
> 
> 1420 bytes is unlikely to be the most common case - at least for some users.
> SCTP is message oriented so the checksum is over a 'user message'.
> A non-uncommon use is carrying mobile network messages (eg SMS) over the IP
> network (instead of TDM links).
> In that case the maximum data chunk size (what is being checksummed) is limited
> to not much over 256 bytes - and a lot of data chunks will be smaller.
> The actual difficulty is getting multiple data chunks into a single ethernet
> packet without adding significant delays.
> 
> But the changes definitely improve things.

Interesting.  Of course, the data I gave shows that the proportional performance
increase is even greater on short packets than long ones.  I'll include those
tables when I resend the patchset and add a row for 256 bytes too.

- Eric