Re: [PATCH 2/2] xdiff: optimize xdl_hash_record_verbatim

Phillip Wood <phillip.wood123@xxxxxxxxx> · Wed, 13 Aug 2025 14:10:56 +0100

Hi Alexander

On 11/08/2025 15:14, Alexander Monakov wrote:

On Mon, 11 Aug 2025, Phillip Wood wrote:

That's what the 'cycles' column in the table gives (6.21/5.8 = 1.070...)

It would be helpful to add a column with those calculations in it rather than
forcing the reader to calculate the speed up for themselves.

Ok, will change it to

version | speedup over (A) | cycles, bn | instructions, bn
----------------------------------------------------------
A                            6.38         11.3
B         1.027              6.21         10.89
C         1.1                5.80          9.95
D         1.094              5.83          8.74
----------------------------------------------------------

That looks good, thanks

Also what is the cycles column measuring? What is it that takes 6.21 cycles
for B and only 5.8 cycles for C?

Billions of cycles, e.g. in C the entire command completes in 5.8e9 CPU cycles.

Ah, for some reason I'd not realized than bn was short for billion
Then you get 9% from the inlining patch and only 2% from the faster hash
function? That's a bit surprising, which compiler and CPU you used? Is it
with default optimization (-O2)?

I used gcc with -O2 -march=native on an i5-8500. I saw a similar improvement
from the inlining when I was playing with xxhash.

Thanks, I'll see if I can benchmark it on a Skylake in the coming days. That
said, I think most users will get Git from their distro, without -march=native,
right? So I'd suggest looking at plain -O2, especially for xxhash, which
selects hashing primitives based on CPU-indicating predefined macros.

For xxhash I was using the system library rather than compiling it myself
I'd say under reasonable assumptions (e.g. a not too ancient CPU with
3-cycle integer multiplication) the new scheme is generally faster even
without asm.

Thanks, fwiw I don't see a measurable difference in the timings with and
without the asm on my machine -

To be clear, by "without asm" you mean forcing the !__GNUC__ branch where
REASSOC_FENCE macro is empty?

Exactly

sometimes one is faster, sometimes the other, any difference is within the
noise.

Would you mind showing your 'gcc --version'?

gcc (Debian 12.2.0-14+deb12u1) 12.2.0

Also, I prefer 'perf stat' for
such measurements, because its measurements are not so sensitive to frequency
scaling (plus, you can compare my cycles/instructions counts with yours if you
run 'perf stat', but I cannot compare your seconds from hyperfine with mine
because of course my CPU runs at a different frequency than yours).

'perf stat -r 5' runs the workload 5 times and prints averages and deviation.

I'll try and take a look at that though I'm off line next week and I'm 
not sure I'll have time before then.
No, what we need to do here is outside of the abstract machine's view,
standard functions are not going to help.

That's a shame. I'd hoped that stopping the compiler reorder the code would do
the same thing - what is the asm doing that's different?

atomic_signal_fence only blocks reordering of references to memory that can be
observed from a signal handler interrupting the current thread. It has no effect
on variables whose addresses do not escape (let alone never taken in the first
place). Here we want to force a particular evaluation order for variables that
end up on registers and are not supposed to appear in memory at all.

Ah, that makes sense

Thanks

Phillip