Re: [PATCH] describe: use khash in finish_depth_computation()

René Scharfe <l.s.r@xxxxxx> · Sun, 31 Aug 2025 19:25:13 +0200

On 8/25/25 9:34 AM, Jeff King wrote:
> On Sun, Aug 24, 2025 at 06:32:47PM +0200, René Scharfe wrote:
> 
>>>> Use the commit index value as a hash because it is unique, has the
>>>> right size and needs no computation.  We could also derive the hash
>>>> directly from the pointer value, but that requires slightly more effort.
>>>
>>> Interesting. This is exactly what commit-slabs were created for (and are
>>> why the convenient index value is there in the first place!).
>>
>> Kinda -- they have a payload value, while a khash set only stores keys.
>> A commit slab with an int payload would still be smaller than a khash
>> set with 64-bit pointers as keys -- IF we were to add all commits.  Here
>> we typically add just a few, but a pathological history could add a lot;
>> not sure if there's a boundary.  Hmm.  So you might be able to find
>> examples where commit-slabs win.
> 
> A khash set doesn't have a "vals" array, but I think it does always keep
> the flags array, which is a uint32_t per bucket. That's how it knows
> which buckets have a value in them (since it does not otherwise assume
> there is a sentinel value like NULL). So even though its API is that of
> a set, you can think of it as a mapping of keys to flags. So a mapping
> to an int (or even a char) should cost the same.

Admittedly I *did* ignore the flags and they *are* stored in an array of
uint32_t, but they occupy only two bits per bucket (see __ac_isempty,
__ac_isdel, __ac_fsize).

>>> +define_commit_slab(commit_counter, int);
>>
>> We only need one bit, so a uint8_t or char would suffice.
> 
> True, though that doesn't seem to change timings much for me (I used
> "int" because that's what your khash used, but I think the type is a
> dummy value in "set" mode).
I also saw no significant change.  Yes, the second type is not actually
used in a khash set.

> Thinking on it more, how much benefit are we getting from the use
> of commit->index in your patch? In other contexts where we store oids or
> objects in hash tables, we use the low bits of the oid directly. So it's
> similarly cheap. If I do this tweak on top of your patch:
> 
> diff --git a/builtin/describe.c b/builtin/describe.c
> index edb4dec79d..e024feb080 100644
> --- a/builtin/describe.c
> +++ b/builtin/describe.c
> @@ -289,7 +289,7 @@ static void lazy_queue_clear(struct lazy_queue *queue)
>  
>  static inline unsigned int commit_index(const struct commit *commit)
>  {
> -	return commit->index;
> +	return oidhash(&commit->object.oid);
>  }
>  
>  static inline int ptr_eq(const void *a, const void *b)
> 
> I get similar results (actually faster, but I think within the noise).

Sure.  I'm not comfortable with oidhash() though, because it allows
attackers to influence the hash value, cause collisions and thus
increase the cost of lookups and inserts to O(N), leading to quadratic
complexity overall.

They "just" need to construct commits with a common hash prefix.  I
guess that's easy for two bytes and hard for four bytes.  Not sure how
what an attacker would get out of planting such performance traps, but
I guess some people would do it just for the heck of it.

Letting oidhash() XOR in another word would close that line of attack
for quite a while, I assume.

> Which made me think more (always a danger). We already have an oidset
> data structure, which is backed by a khash. It's not _quite_ the same
> thing, because it's going to store an actual 36-byte oid struct as the
> hash key, rather than an 8-byte pointer to a "struct object" (which
> contains that same oid). And likewise when we do a lookup, the bucket
> search requires a larger memcmp() than the pointer equality. But how
> much difference does that make?
> 
> If I instead do this on top of your patch:
> 
> diff --git a/builtin/describe.c b/builtin/describe.c
> index edb4dec79d..38ce0c1978 100644
> --- a/builtin/describe.c
> +++ b/builtin/describe.c
> @@ -287,41 +287,16 @@ static void lazy_queue_clear(struct lazy_queue *queue)
>  	queue->get_pending = false;
>  }
>  
> -static inline unsigned int commit_index(const struct commit *commit)
> -{
> -	return commit->index;
> -}
> -
> -static inline int ptr_eq(const void *a, const void *b)
> -{
> -	return a == b;
> -}
> -
> -KHASH_INIT(commit_set, struct commit *, int, 0, commit_index, ptr_eq)
> -
> -static void commit_set_insert(kh_commit_set_t *set, struct commit *commit)
> -{
> -	int added;
> -	kh_put_commit_set(set, commit, &added);
> -}
> -
> -static void commit_set_remove(kh_commit_set_t *set, struct commit *commit)
> -{
> -	khiter_t pos = kh_get_commit_set(set, commit);
> -	if (pos != kh_end(set))
> -		kh_del_commit_set(set, pos);
> -}
> -
>  static unsigned long finish_depth_computation(struct lazy_queue *queue,
>  					      struct possible_tag *best)
>  {
>  	unsigned long seen_commits = 0;
> -	kh_commit_set_t unflagged = { 0 };
> +	struct oidset unflagged = OIDSET_INIT;
>  
>  	for (size_t i = queue->get_pending ? 1 : 0; i < queue->queue.nr; i++) {
>  		struct commit *commit = queue->queue.array[i].data;
>  		if (!(commit->object.flags & best->flag_within))
> -			commit_set_insert(&unflagged, commit);
> +			oidset_insert(&unflagged, &commit->object.oid);
>  	}
>  
>  	while (!lazy_queue_empty(queue)) {
> @@ -330,10 +305,10 @@ static unsigned long finish_depth_computation(struct lazy_queue *queue,
>  
>  		seen_commits++;
>  		if (c->object.flags & best->flag_within) {
> -			if (!kh_size(&unflagged))
> +			if (!oidset_size(&unflagged))
>  				break;
>  		} else {
> -			commit_set_remove(&unflagged, c);
> +			oidset_remove(&unflagged, &c->object.oid);
>  			best->depth++;
>  		}
>  		while (parents) {
> @@ -348,13 +323,13 @@ static unsigned long finish_depth_computation(struct lazy_queue *queue,
>  			p->object.flags |= c->object.flags;
>  			flag_after = p->object.flags & best->flag_within;
>  			if (!seen && !flag_after)
> -				commit_set_insert(&unflagged, p);
> +				oidset_insert(&unflagged, &p->object.oid);
>  			if (seen && !flag_before && flag_after)
> -				commit_set_remove(&unflagged, p);
> +				oidset_remove(&unflagged, &p->object.oid);
>  			parents = parents->next;
>  		}
>  	}
> -	kh_release_commit_set(&unflagged);
> +	oidset_clear(&unflagged);
>  	return seen_commits;
>  }
>  
> 
> then I likewise get very similar timings. Your version seems to be
> consistently a little bit faster, but within the run-to-run noise of
> ~1%. But the bonus here is that we didn't need to define a new hash
> type, nor do any tricks with the commit->index field.

I don't see any performance difference at all.  Using the existing
oidset structure is clearly better under these circumstances.

> Now if we really are worried about those extra bytes in storing a fresh
> oid, is there room for new data types? I.e., A "struct objset" that
> stores object pointers, hashes based on the oid, and uses pointer
> equality to find a match?  And likewise a "struct objmap" that could
> perhaps compete with commit-slab (but be more pleasant to use).

Maybe.  It's a matter of measuring the performance difference, though,
and that's hard.

>>>   2. Can the hash version handle strides of different sizes? One of the
>>>      points of commit-slab is that the fixed size of the value type can
>>>      be set at runtime (so you could have a slab of 32 bits per commit,
>>>      or 132, depending on your traversal needs).
>>
>> Dynamic value sizes require an indirection via a pointer, or at least I
>> don't see any other way.  What would be a possible use case?  (Don't see
>> any in-tree.)
> 
> The value isn't totally dynamic; it's static for a single instantiation
> of a slab. I don't think khash supports that (it has a value type and
> lets the compiler take care of indexing into an array of that value).
> But you could imagine a world where khash (or some custom hash code)
> only knows that it's storing N bytes per item, and does the indexing
> multiplication itself.
> 
> But yes, I don't think we are even using that feature of commit-slab
> currently. I used it in a series for fast --contains long ago:
> 
>   https://lore.kernel.org/git/20140625233429.GA20457@xxxxxxxxxxxxxxxxxxxxx/
> 
> but can't remember why I never got back to it. It would also be useful
> for show-branch, which wants to paint down for each starting tip. But
> given that we haven't used it, it may just not be that useful a feature.

Or for describe, actually, to allow an arbitrarily large --max-candidates
value.

René