Toon Claes <toon@xxxxxxxxx> writes: > Karthik Nayak <karthik.188@xxxxxxxxx> writes: > >> The 'git bundle create' command has non-linear performance with the >> number of refs in the repository. Benchmarking the command shows that >> a large portion of the time (~75%) is spent in the >> `object_array_remove_duplicates()` function. >> >> The `object_array_remove_duplicates()` function was added in >> b2a6d1c686 (bundle: allow the same ref to be given more than once, >> 2009-01-17) to skip duplicate refs provided by the user from being >> written to the bundle. Since this is an O(N^2) algorithm, in repos with >> large number of references, this can take up a large amount of time. >> >> Let's instead use a 'strset' to skip duplicates inside >> `write_bundle_refs()`. This improves the performance by around 6 times >> when tested against in repository with 100000 refs: >> >> Benchmark 1: bundle (refcount = 100000, revision = master) >> Time (mean ± σ): 14.653 s ± 0.203 s [User: 13.940 s, System: 0.762 s] >> Range (min … max): 14.237 s … 14.920 s 10 runs >> >> Benchmark 2: bundle (refcount = 100000, revision = HEAD) >> Time (mean ± σ): 2.394 s ± 0.023 s [User: 1.684 s, System: 0.798 s] >> Range (min … max): 2.364 s … 2.425 s 10 runs >> >> Summary >> bundle (refcount = 100000, revision = HEAD) ran >> 6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master) > > That's a good find! > >> Previously, `object_array_remove_duplicates()` ensured that both the >> refname and the object it pointed to were checked for duplicates. The >> new approach, implemented within `write_bundle_refs()`, eliminates >> duplicate refnames without comparing the objects they reference. This >> works because, for bundle creation, we only need to prevent duplicate >> refs from being written to the bundle header. The `revs->pending` array >> can contain duplicates of multiple types. > > Makes sense to me. > >> First, references which resolve to the same refname. For e.g. "git >> bundle create out.bdl master master" or "git bundle create out.bdl >> refs/heads/master refs/heads/master" or "git bundle create out.bdl >> master refs/heads/master". In these scenarios we want to prevent writing >> "refs/heads/master" twice to the bundle header. Since both the refnames >> here would point to the same object (unless there is a race), we do not >> need to check equality of the object. > > Yeah, we can never be sure about the changes that happen while the > bundle is being created. I fixed another race[1] recently which also was > comparing equality of the object, that causes the ref to be omitted. We > can only act by "best effort" and having the ref point to /some/ object > is the best we can do. > > [1]: https://lore.kernel.org/git/20241211-fix-bundle-create-race-v3-1-0587f6f9db1b@xxxxxxxxx/ > >> Second, refnames which are duplicates but do not point to the same >> object. This can happen when we use an exclusion criteria. For e.g. "git >> bundle create out.bdl master master^!", Here `revs->pending` would >> contain two elements, both with refname set to "master". However, each >> of them would be pointing to an INTERESTING and UNINTERESTING object >> respectively. Since we only write refnames with INTERESTING objects to >> the bundle header, we perform our duplicate checks only on such >> objects. > > Thanks for that context, I didn't consider that. > I didn't at first, but luckily we have a test for such refs, which got tripped and allowed me to consider that scenario too! >> Signed-off-by: Karthik Nayak <karthik.188@xxxxxxxxx> >> --- >> bundle.c | 10 +++++++++- >> object.c | 33 --------------------------------- >> object.h | 6 ------ >> t/t6020-bundle-misc.sh | 4 ---- >> 4 files changed, 9 insertions(+), 44 deletions(-) >> >> diff --git a/bundle.c b/bundle.c >> index d7ad690843..30cfba0be2 100644 >> --- a/bundle.c >> +++ b/bundle.c >> @@ -384,6 +384,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) >> { >> int i; >> int ref_count = 0; >> + struct strset objects; >> + >> + strset_init(&objects); > > Any reason why you're not using the `STRMAP_INIT` macro? > None. This should be the ideal way to do it, will change in the next version. >> >> for (i = 0; i < revs->pending.nr; i++) { >> struct object_array_entry *e = revs->pending.objects + i; >> @@ -401,6 +404,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) >> flag = 0; >> display_ref = (flag & REF_ISSYMREF) ? e->name : ref; >> >> + if (strset_contains(&objects, display_ref)) >> + goto skip_write_ref; >> + >> if (e->item->type == OBJ_TAG && >> !is_tag_in_date_range(e->item, revs)) { >> e->item->flags |= UNINTERESTING; >> @@ -423,6 +429,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) >> } >> >> ref_count++; >> + strset_add(&objects, display_ref); >> write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz); >> write_or_die(bundle_fd, " ", 1); >> write_or_die(bundle_fd, display_ref, strlen(display_ref)); >> @@ -431,6 +438,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) >> free(ref); >> } >> >> + strset_clear(&objects); >> + >> /* end header */ >> write_or_die(bundle_fd, "\n", 1); >> return ref_count; >> @@ -566,7 +575,6 @@ int create_bundle(struct repository *r, const char *path, >> */ >> revs.blob_objects = revs.tree_objects = 0; >> traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi); >> - object_array_remove_duplicates(&revs_copy.pending); >> >> /* write bundle refs */ >> ref_count = write_bundle_refs(bundle_fd, &revs_copy); >> diff --git a/object.c b/object.c >> index 100bf9b8d1..a2c5986178 100644 >> --- a/object.c >> +++ b/object.c >> @@ -491,39 +491,6 @@ void object_array_clear(struct object_array *array) >> array->nr = array->alloc = 0; >> } >> >> -/* >> - * Return true if array already contains an entry. >> - */ >> -static int contains_object(struct object_array *array, >> - const struct object *item, const char *name) >> -{ >> - unsigned nr = array->nr, i; >> - struct object_array_entry *object = array->objects; >> - >> - for (i = 0; i < nr; i++, object++) >> - if (item == object->item && !strcmp(object->name, name)) >> - return 1; >> - return 0; >> -} >> - >> -void object_array_remove_duplicates(struct object_array *array) >> -{ >> - unsigned nr = array->nr, src; >> - struct object_array_entry *objects = array->objects; >> - >> - array->nr = 0; >> - for (src = 0; src < nr; src++) { >> - if (!contains_object(array, objects[src].item, >> - objects[src].name)) { >> - if (src != array->nr) >> - objects[array->nr] = objects[src]; >> - array->nr++; >> - } else { >> - object_array_release_entry(&objects[src]); >> - } >> - } >> -} >> - >> void clear_object_flags(unsigned flags) >> { >> int i; >> diff --git a/object.h b/object.h >> index 17f32f1103..0e12c75922 100644 >> --- a/object.h >> +++ b/object.h >> @@ -324,12 +324,6 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *); >> void object_array_filter(struct object_array *array, >> object_array_each_func_t want, void *cb_data); >> >> -/* >> - * Remove from array all but the first entry with a given name. >> - * Warning: this function uses an O(N^2) algorithm. > > Funny this has been here for more than 10 years. Thanks for this cleanup. > >> - */ >> -void object_array_remove_duplicates(struct object_array *array); >> - >> /* >> * Remove any objects from the array, freeing all used memory; afterwards >> * the array is ready to store more objects with add_object_array(). >> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh >> index dd09df1287..500c81b8a1 100755 >> --- a/t/t6020-bundle-misc.sh >> +++ b/t/t6020-bundle-misc.sh >> @@ -684,7 +684,6 @@ test_expect_success 'create bundle with duplicate refnames' ' >> test_cmp expect actual >> ' >> >> -# This exhibits a bug, since the same refname is now added to the bundle twice. >> test_expect_success 'create bundle with duplicate refnames and --all' ' >> git bundle create out.bdl --all "main" "main" && >> >> @@ -701,7 +700,6 @@ test_expect_success 'create bundle with duplicate refnames and --all' ' >> <TAG-2> refs/tags/v2 >> <TAG-3> refs/tags/v3 >> <COMMIT-P> HEAD >> - <COMMIT-P> refs/heads/main >> EOF >> test_cmp expect actual >> ' >> @@ -717,7 +715,6 @@ test_expect_success 'create bundle with duplicate exlusion refnames' ' >> test_cmp expect actual >> ' >> >> -# This exhibits a bug, since the same refname is now added to the bundle twice. >> test_expect_success 'create bundle with duplicate refname short-form' ' >> git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" && >> >> @@ -725,7 +722,6 @@ test_expect_success 'create bundle with duplicate refname short-form' ' >> make_user_friendly_and_stable_output >actual && >> cat >expect <<-\EOF && >> <COMMIT-P> refs/heads/main >> - <COMMIT-P> refs/heads/main >> EOF >> test_cmp expect actual >> ' > > Great work on the alternative implmentation. And thanks for adding these > tests and actually fixing them. I've been manually testing a few more > edge cases, I couldn't find any other scenario that's not covered by the > current implementation. > > I approve. > > -- > Toon Appreciate you taking the time and exploring all scenarios. Thanks for the review. Karthik
Attachment:
signature.asc
Description: PGP signature