Re: [PATCH v2] abbrev: allow extending beyond 20 chars to disambiguate

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 12 Aug 2025 09:28:43 -0400

On 8/11/2025 3:06 PM, Junio C Hamano wrote:
> When you have two or more objects with object names that share more
> than half the length of the hash algorithm in use (e.g. 10 bytes for
> SHA-1 that produces 20-byte/160-bit hash), find_unique_abbrev()
> fails to show disambiguation.
> 
> To see how many leading letters of a given full object name is
> sufficiently unambiguous, the algorithm starts from a initial
> length, guessed based on the estimated number of objects in the
> repository, and see if another object that shares the prefix, and
> keeps extending the abbreviation.  The loop stops at GIT_MAX_RAWSZ,
> which is counted as the number of bytes, since 5b20ace6 (sha1_name:
> unroll len loop in find_unique_abbrev_r(), 2017-10-08);

Wow. My first Git contribution.

Some style issues that you're opportunistically cleaning up are
due to my newness, for sure.

> before that
> change, it extended up to GIT_SHA1_HEXSZ, which was the correct
> limit because the loop is adding one output letter per iteration
> and back then SHA256 was not in the picture.
> 
> Pass the max length of the hash being in use in the current
> repository down the code path, and use it to compute the code to
> update the abbreviation length required to make it unique.
> 
> Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>
> ---
>  object-name.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/object-name.c b/object-name.c
> index 11aa0e6afc..8f9af57c0a 100644
> --- a/object-name.c
> +++ b/object-name.c
> @@ -680,6 +680,7 @@ static unsigned msb(unsigned long val)
>  struct min_abbrev_data {
>  	unsigned int init_len;
>  	unsigned int cur_len;
> +	unsigned int max_len;
>  	char *hex;
>  	struct repository *repo;
>  	const struct object_id *oid;
> @@ -699,12 +700,12 @@ static inline char get_hex_char_from_oid(const struct object_id *oid,
>  static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
>  {
>  	struct min_abbrev_data *mad = cb_data;
> -
>  	unsigned int i = mad->init_len;
> +
>  	while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
>  		i++;
>  
> -	if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
> +	if (mad->cur_len <= i && i < mad->max_len)
>  		mad->cur_len = i + 1;

This logic is all about not extending the abbreviation
length to beyond the length of the hex array, so your
limits make sense. Moving the comparisons here helps
with readability.  
>  	return 0;
> @@ -864,6 +865,7 @@ int repo_find_unique_abbrev_r(struct repository *r, char *hex,
>  	mad.repo = r;
>  	mad.init_len = len;
>  	mad.cur_len = len;
> +	mad.max_len = hexsz;

This new parameter required for allowing SHA256 is valuable.

I agree that we shouldn't need a test case to guarantee this
in the future. Good to clean up unnecessary uses of the
macro limits.

Thanks,
-Stolee