On Mon, 11 Aug 2025, Phillip Wood wrote: > > That's what the 'cycles' column in the table gives (6.21/5.8 = 1.070...) > > It would be helpful to add a column with those calculations in it rather than > forcing the reader to calculate the speed up for themselves. Ok, will change it to version | speedup over (A) | cycles, bn | instructions, bn ---------------------------------------------------------- A 6.38 11.3 B 1.027 6.21 10.89 C 1.1 5.80 9.95 D 1.094 5.83 8.74 ---------------------------------------------------------- > Also what is the cycles column measuring? What is it that takes 6.21 cycles > for B and only 5.8 cycles for C? Billions of cycles, e.g. in C the entire command completes in 5.8e9 CPU cycles. > > Then you get 9% from the inlining patch and only 2% from the faster hash > > function? That's a bit surprising, which compiler and CPU you used? Is it > > with default optimization (-O2)? > > I used gcc with -O2 -march=native on an i5-8500. I saw a similar improvement > from the inlining when I was playing with xxhash. Thanks, I'll see if I can benchmark it on a Skylake in the coming days. That said, I think most users will get Git from their distro, without -march=native, right? So I'd suggest looking at plain -O2, especially for xxhash, which selects hashing primitives based on CPU-indicating predefined macros. > > I'd say under reasonable assumptions (e.g. a not too ancient CPU with > > 3-cycle integer multiplication) the new scheme is generally faster even > > without asm. > > Thanks, fwiw I don't see a measurable difference in the timings with and > without the asm on my machine - To be clear, by "without asm" you mean forcing the !__GNUC__ branch where REASSOC_FENCE macro is empty? > sometimes one is faster, sometimes the other, any difference is within the > noise. Would you mind showing your 'gcc --version'? Also, I prefer 'perf stat' for such measurements, because its measurements are not so sensitive to frequency scaling (plus, you can compare my cycles/instructions counts with yours if you run 'perf stat', but I cannot compare your seconds from hyperfine with mine because of course my CPU runs at a different frequency than yours). 'perf stat -r 5' runs the workload 5 times and prints averages and deviation. > > No, what we need to do here is outside of the abstract machine's view, > > standard functions are not going to help. > > That's a shame. I'd hoped that stopping the compiler reorder the code would do > the same thing - what is the asm doing that's different? atomic_signal_fence only blocks reordering of references to memory that can be observed from a signal handler interrupting the current thread. It has no effect on variables whose addresses do not escape (let alone never taken in the first place). Here we want to force a particular evaluation order for variables that end up on registers and are not supposed to appear in memory at all. Alexander