Re: [PATCH 2/2] xdiff: optimize xdl_hash_record_verbatim

Phillip Wood <phillip.wood123@xxxxxxxxx> · Mon, 11 Aug 2025 14:13:26 +0100

On 04/08/2025 15:39, Alexander Monakov wrote:
On Mon, 4 Aug 2025, Phillip Wood wrote:

Switch xdl_hash_record_verbatim to additive hashing and implement
an optimized loop following the scheme suggested by Noah.

Timing 'git log --oneline --shortstat v2.0.0..v2.5.0' under perf, I got

version | cycles, bn | instructions, bn
---------------------------------------
A         6.38         11.3
B         6.21         10.89
C         5.80          9.95
D         5.83          8.74
---------------------------------------

A: baseline (git master at e4ef0485fd78)
B: plus 'xdiff: refactor xdl_hash_record()'
C: and plus this patch
D: with 'xdiff: use xxhash' by Phillip Wood

I think it would be helpful to say that B is the previous patch and provide a
link for D.

Ok, reworded locally, will appear in v2.

Thanks

The resulting speedup for xdl_hash_record_verbatim itself is about 1.5x.

While that's interesting it does not tell us how much this speeds up diff
generation.

That's what the 'cycles' column in the table gives (6.21/5.8 = 1.070...)

It would be helpful to add a column with those calculations in it rather 
than forcing the reader to calculate the speed up for themselves. Also 
what is the cycles column measuring? What is it that takes 6.21 cycles 
for B and only 5.8 cycles for C?

Running the command above under hyperfine it is 1.02 ± 0.01 times
faster than the previous patch and 1.11 ± 0.01 times faster than master.

Then you get 9% from the inlining patch and only 2% from the faster hash
function? That's a bit surprising, which compiler and CPU you used? Is it
with default optimization (-O2)?

I used gcc with -O2 -march=native on an i5-8500. I saw a similar 
improvement from the inlining when I was playing with xxhash.

Using
xxhash (D above) is 1.03 ± 0.01 times faster than this patch. How do the
changes below affect compilers other than gcc and clang than do not see the
re-association barrier?

I'd say under reasonable assumptions (e.g. a not too ancient CPU with 3-cycle
integer multiplication) the new scheme is generally faster even without asm.

Thanks, fwiw I don't see a measurable difference in the timings with and 
without the asm on my machine - sometimes one is faster, sometimes the 
other, any difference is within the noise.

But Git can certainly follow Glibc's choice and employ this only on x86_64
(and only with GCC or Clang).

We'd want to make sure that it does not result in
slower diffs. Can we use atomic_signal_fence() on compilers that support C11?

No, what we need to do here is outside of the abstract machine's view, standard
functions are not going to help.

That's a shame. I'd hoped that stopping the compiler reorder the code 
would do the same thing - what is the asm doing that's different?

Thanks

Phillip