Re: [PATCH 2/2] bundle: fix non-linear performance scaling with refs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Toon Claes <toon@xxxxxxxxx> writes:

> Karthik Nayak <karthik.188@xxxxxxxxx> writes:
>
>> The 'git bundle create' command has non-linear performance with the
>> number of refs in the repository. Benchmarking the command shows that
>> a large portion of the time (~75%) is spent in the
>> `object_array_remove_duplicates()` function.
>>
>> The `object_array_remove_duplicates()` function was added in
>> b2a6d1c686 (bundle: allow the same ref to be given more than once,
>> 2009-01-17) to skip duplicate refs provided by the user from being
>> written to the bundle. Since this is an O(N^2) algorithm, in repos with
>> large number of references, this can take up a large amount of time.
>>
>> Let's instead use a 'strset' to skip duplicates inside
>> `write_bundle_refs()`. This improves the performance by around 6 times
>> when tested against in repository with 100000 refs:
>>
>> Benchmark 1: bundle (refcount = 100000, revision = master)
>>   Time (mean ± σ):     14.653 s ±  0.203 s    [User: 13.940 s, System: 0.762 s]
>>   Range (min … max):   14.237 s … 14.920 s    10 runs
>>
>> Benchmark 2: bundle (refcount = 100000, revision = HEAD)
>>   Time (mean ± σ):      2.394 s ±  0.023 s    [User: 1.684 s, System: 0.798 s]
>>   Range (min … max):    2.364 s …  2.425 s    10 runs
>>
>> Summary
>>   bundle (refcount = 100000, revision = HEAD) ran
>>     6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)
>
> That's a good find!
>
>> Previously, `object_array_remove_duplicates()` ensured that both the
>> refname and the object it pointed to were checked for duplicates. The
>> new approach, implemented within `write_bundle_refs()`, eliminates
>> duplicate refnames without comparing the objects they reference. This
>> works because, for bundle creation, we only need to prevent duplicate
>> refs from being written to the bundle header. The `revs->pending` array
>> can contain duplicates of multiple types.
>
> Makes sense to me.
>
>> First, references which resolve to the same refname. For e.g. "git
>> bundle create out.bdl master master" or "git bundle create out.bdl
>> refs/heads/master refs/heads/master" or "git bundle create out.bdl
>> master refs/heads/master". In these scenarios we want to prevent writing
>> "refs/heads/master" twice to the bundle header. Since both the refnames
>> here would point to the same object (unless there is a race), we do not
>> need to check equality of the object.
>
> Yeah, we can never be sure about the changes that happen while the
> bundle is being created. I fixed another race[1] recently which also was
> comparing equality of the object, that causes the ref to be omitted. We
> can only act by "best effort" and having the ref point to /some/ object
> is the best we can do.
>
> [1]: https://lore.kernel.org/git/20241211-fix-bundle-create-race-v3-1-0587f6f9db1b@xxxxxxxxx/
>
>> Second, refnames which are duplicates but do not point to the same
>> object. This can happen when we use an exclusion criteria. For e.g. "git
>> bundle create out.bdl master master^!", Here `revs->pending` would
>> contain two elements, both with refname set to "master". However, each
>> of them would be pointing to an INTERESTING and UNINTERESTING object
>> respectively. Since we only write refnames with INTERESTING objects to
>> the bundle header, we perform our duplicate checks only on such
>> objects.
>
> Thanks for that context, I didn't consider that.
>

I didn't at first, but luckily we have a test for such refs, which got
tripped and allowed me to consider that scenario too!

>> Signed-off-by: Karthik Nayak <karthik.188@xxxxxxxxx>
>> ---
>>  bundle.c               | 10 +++++++++-
>>  object.c               | 33 ---------------------------------
>>  object.h               |  6 ------
>>  t/t6020-bundle-misc.sh |  4 ----
>>  4 files changed, 9 insertions(+), 44 deletions(-)
>>
>> diff --git a/bundle.c b/bundle.c
>> index d7ad690843..30cfba0be2 100644
>> --- a/bundle.c
>> +++ b/bundle.c
>> @@ -384,6 +384,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>>  {
>>  	int i;
>>  	int ref_count = 0;
>> +	struct strset objects;
>> +
>> +	strset_init(&objects);
>
> Any reason why you're not using the `STRMAP_INIT` macro?
>

None. This should be the ideal way to do it, will change in the next
version.

>>
>>  	for (i = 0; i < revs->pending.nr; i++) {
>>  		struct object_array_entry *e = revs->pending.objects + i;
>> @@ -401,6 +404,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>>  			flag = 0;
>>  		display_ref = (flag & REF_ISSYMREF) ? e->name : ref;
>>
>> +		if (strset_contains(&objects, display_ref))
>> +			goto skip_write_ref;
>> +
>>  		if (e->item->type == OBJ_TAG &&
>>  				!is_tag_in_date_range(e->item, revs)) {
>>  			e->item->flags |= UNINTERESTING;
>> @@ -423,6 +429,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>>  		}
>>
>>  		ref_count++;
>> +		strset_add(&objects, display_ref);
>>  		write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz);
>>  		write_or_die(bundle_fd, " ", 1);
>>  		write_or_die(bundle_fd, display_ref, strlen(display_ref));
>> @@ -431,6 +438,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>>  		free(ref);
>>  	}
>>
>> +	strset_clear(&objects);
>> +
>>  	/* end header */
>>  	write_or_die(bundle_fd, "\n", 1);
>>  	return ref_count;
>> @@ -566,7 +575,6 @@ int create_bundle(struct repository *r, const char *path,
>>  	 */
>>  	revs.blob_objects = revs.tree_objects = 0;
>>  	traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
>> -	object_array_remove_duplicates(&revs_copy.pending);
>>
>>  	/* write bundle refs */
>>  	ref_count = write_bundle_refs(bundle_fd, &revs_copy);
>> diff --git a/object.c b/object.c
>> index 100bf9b8d1..a2c5986178 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -491,39 +491,6 @@ void object_array_clear(struct object_array *array)
>>  	array->nr = array->alloc = 0;
>>  }
>>
>> -/*
>> - * Return true if array already contains an entry.
>> - */
>> -static int contains_object(struct object_array *array,
>> -			   const struct object *item, const char *name)
>> -{
>> -	unsigned nr = array->nr, i;
>> -	struct object_array_entry *object = array->objects;
>> -
>> -	for (i = 0; i < nr; i++, object++)
>> -		if (item == object->item && !strcmp(object->name, name))
>> -			return 1;
>> -	return 0;
>> -}
>> -
>> -void object_array_remove_duplicates(struct object_array *array)
>> -{
>> -	unsigned nr = array->nr, src;
>> -	struct object_array_entry *objects = array->objects;
>> -
>> -	array->nr = 0;
>> -	for (src = 0; src < nr; src++) {
>> -		if (!contains_object(array, objects[src].item,
>> -				     objects[src].name)) {
>> -			if (src != array->nr)
>> -				objects[array->nr] = objects[src];
>> -			array->nr++;
>> -		} else {
>> -			object_array_release_entry(&objects[src]);
>> -		}
>> -	}
>> -}
>> -
>>  void clear_object_flags(unsigned flags)
>>  {
>>  	int i;
>> diff --git a/object.h b/object.h
>> index 17f32f1103..0e12c75922 100644
>> --- a/object.h
>> +++ b/object.h
>> @@ -324,12 +324,6 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
>>  void object_array_filter(struct object_array *array,
>>  			 object_array_each_func_t want, void *cb_data);
>>
>> -/*
>> - * Remove from array all but the first entry with a given name.
>> - * Warning: this function uses an O(N^2) algorithm.
>
> Funny this has been here for more than 10 years. Thanks for this cleanup.
>
>> - */
>> -void object_array_remove_duplicates(struct object_array *array);
>> -
>>  /*
>>   * Remove any objects from the array, freeing all used memory; afterwards
>>   * the array is ready to store more objects with add_object_array().
>> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
>> index dd09df1287..500c81b8a1 100755
>> --- a/t/t6020-bundle-misc.sh
>> +++ b/t/t6020-bundle-misc.sh
>> @@ -684,7 +684,6 @@ test_expect_success 'create bundle with duplicate refnames' '
>>  	test_cmp expect actual
>>  '
>>
>> -# This exhibits a bug, since the same refname is now added to the bundle twice.
>>  test_expect_success 'create bundle with duplicate refnames and --all' '
>>  	git bundle create out.bdl --all "main" "main" &&
>>
>> @@ -701,7 +700,6 @@ test_expect_success 'create bundle with duplicate refnames and --all' '
>>  	<TAG-2> refs/tags/v2
>>  	<TAG-3> refs/tags/v3
>>  	<COMMIT-P> HEAD
>> -	<COMMIT-P> refs/heads/main
>>  	EOF
>>  	test_cmp expect actual
>>  '
>> @@ -717,7 +715,6 @@ test_expect_success 'create bundle with duplicate exlusion refnames' '
>>  	test_cmp expect actual
>>  '
>>
>> -# This exhibits a bug, since the same refname is now added to the bundle twice.
>>  test_expect_success 'create bundle with duplicate refname short-form' '
>>  	git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" &&
>>
>> @@ -725,7 +722,6 @@ test_expect_success 'create bundle with duplicate refname short-form' '
>>  		make_user_friendly_and_stable_output >actual &&
>>  	cat >expect <<-\EOF &&
>>  	<COMMIT-P> refs/heads/main
>> -	<COMMIT-P> refs/heads/main
>>  	EOF
>>  	test_cmp expect actual
>>  '
>
> Great work on the alternative implmentation. And thanks for adding these
> tests and actually fixing them. I've been manually testing a few more
> edge cases, I couldn't find any other scenario that's not covered by the
> current implementation.
>
> I approve.
>
> --
> Toon

Appreciate you taking the time and exploring all scenarios. Thanks for
the review.

Karthik

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux