Re: [PATCH 2/2] xdiff: optimize xdl_hash_record_verbatim

Alexander Monakov <amonakov@xxxxxxxxx> · Mon, 11 Aug 2025 17:14:39 +0300 (MSK)

On Mon, 11 Aug 2025, Phillip Wood wrote:

> > That's what the 'cycles' column in the table gives (6.21/5.8 = 1.070...)
> 
> It would be helpful to add a column with those calculations in it rather than
> forcing the reader to calculate the speed up for themselves.

Ok, will change it to

version | speedup over (A) | cycles, bn | instructions, bn
----------------------------------------------------------
A                            6.38         11.3
B         1.027              6.21         10.89
C         1.1                5.80          9.95
D         1.094              5.83          8.74
----------------------------------------------------------

> Also what is the cycles column measuring? What is it that takes 6.21 cycles
> for B and only 5.8 cycles for C?

Billions of cycles, e.g. in C the entire command completes in 5.8e9 CPU cycles.

> > Then you get 9% from the inlining patch and only 2% from the faster hash
> > function? That's a bit surprising, which compiler and CPU you used? Is it
> > with default optimization (-O2)?
> 
> I used gcc with -O2 -march=native on an i5-8500. I saw a similar improvement
> from the inlining when I was playing with xxhash.

Thanks, I'll see if I can benchmark it on a Skylake in the coming days. That
said, I think most users will get Git from their distro, without -march=native,
right? So I'd suggest looking at plain -O2, especially for xxhash, which
selects hashing primitives based on CPU-indicating predefined macros.

> > I'd say under reasonable assumptions (e.g. a not too ancient CPU with
> > 3-cycle integer multiplication) the new scheme is generally faster even
> > without asm.
> 
> Thanks, fwiw I don't see a measurable difference in the timings with and
> without the asm on my machine -

To be clear, by "without asm" you mean forcing the !__GNUC__ branch where
REASSOC_FENCE macro is empty?

> sometimes one is faster, sometimes the other, any difference is within the
> noise.

Would you mind showing your 'gcc --version'? Also, I prefer 'perf stat' for
such measurements, because its measurements are not so sensitive to frequency
scaling (plus, you can compare my cycles/instructions counts with yours if you
run 'perf stat', but I cannot compare your seconds from hyperfine with mine
because of course my CPU runs at a different frequency than yours).

'perf stat -r 5' runs the workload 5 times and prints averages and deviation.

> > No, what we need to do here is outside of the abstract machine's view,
> > standard functions are not going to help.
> 
> That's a shame. I'd hoped that stopping the compiler reorder the code would do
> the same thing - what is the asm doing that's different?

atomic_signal_fence only blocks reordering of references to memory that can be
observed from a signal handler interrupting the current thread. It has no effect
on variables whose addresses do not escape (let alone never taken in the first
place). Here we want to force a particular evaluation order for variables that
end up on registers and are not supposed to appear in memory at all.

Alexander