Re: [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs

Toon Claes <toon@xxxxxxxxx> · Thu, 10 Apr 2025 10:57:29 +0200

Karthik Nayak <karthik.188@xxxxxxxxx> writes:

> The 'git bundle create' command has non-linear performance with the
> number of refs in the repository. Benchmarking the command shows that
> a large portion of the time (~75%) is spent in the
> `object_array_remove_duplicates()` function.
>
> The `object_array_remove_duplicates()` function was added in
> b2a6d1c686 (bundle: allow the same ref to be given more than once,
> 2009-01-17) to skip duplicate refs provided by the user from being
> written to the bundle. Since this is an O(N^2) algorithm, in repos with
> large number of references, this can take up a large amount of time.
>
> Let's instead use a 'strset' to skip duplicates inside
> `write_bundle_refs()`. This improves the performance by around 6 times
> when tested against in repository with 100000 refs:
>
> Benchmark 1: bundle (refcount = 100000, revision = master)
>   Time (mean ± σ):     14.653 s ±  0.203 s    [User: 13.940 s, System: 0.762 s]
>   Range (min … max):   14.237 s … 14.920 s    10 runs
>
> Benchmark 2: bundle (refcount = 100000, revision = HEAD)
>   Time (mean ± σ):      2.394 s ±  0.023 s    [User: 1.684 s, System: 0.798 s]
>   Range (min … max):    2.364 s …  2.425 s    10 runs
>
> Summary
>   bundle (refcount = 100000, revision = HEAD) ran
>     6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)

I've done some benchmarking with some "real life" repositories, which
only have a couple of thousand refs and there the difference
(expectedly) barely noticable. Which is good to know there also isn't
any regression.

This version looks good to me, I approve.

--
Toon