Re: [PATCH] describe: use khash in finish_depth_computation()

René Scharfe <l.s.r@xxxxxx> · Tue, 2 Sep 2025 20:51:37 +0200

On 9/2/25 2:38 PM, Jeff King wrote:
> 
> I doubt that commit->index is any better in that regard. If I can
> influence the order in which Git loads the commits (e.g., by creating a
> bunch of refs which get loaded when we walk over for_each_ref), I can
> choose the index for each. They'll be unique, but I can still cause
> collisions modulo the number of buckets.

Hmm, sounds tricky, but feasible. :-O
> Likewise for oidhash(), I'd guess that colliding 4 bytes is not even
> necessary to cause trouble, since probing starts by throwing away
> everything mod n_buckets. So really you just need to collide however
> many low bits you need to make your desired N, and then get O(N^2)
> behavior.

Oh, right.

>> Letting oidhash() XOR in another word would close that line of attack
>> for quite a while, I assume.
> 
> Yeah. We have at least 160 bits in an object hash, and we only bother
> with the low 32. We could XOR up to 5 times, but I agree that even a
> single extra word would probably be plenty. Might be an interesting
> experiment to time something like the patch below on a hash-heavy
> workload.
> 
> diff --git a/hash.h b/hash.h
> index fae966b23c..c9d21f589e 100644
> --- a/hash.h
> +++ b/hash.h
> @@ -457,7 +457,10 @@ static inline unsigned int oidhash(const struct object_id *oid)
>  	 * platforms that don't support unaligned reads.
>  	 */
>  	unsigned int hash;
> +	unsigned int entropy;
>  	memcpy(&hash, oid->hash, sizeof(hash));
> +	memcpy(&entropy, oid->hash + sizeof(hash), sizeof(entropy));
> +	hash ^= entropy;
>  	return hash;
>  }
>  
> 
> I suspect it won't make a big time difference. The old code should have
> been optimized down to a single word load, and now we have two word
> loads and an xor. That probably isn't important compared to the actual
> 5-word memcmp() we have to do in order to verify that we found the right
> bucket anyway.
I see slightly worse performance, but within the noise.

However, just stacking two words won't do if only a few bits of the
resulting hash will be used to find a bucket.  We could mix in more bits
and smear them all over, but if that's done by a deterministic function
then it could be applied during the construction of manipulated object
hash values as well, no?

Perhaps salting with a random value determined at runtime would help.
Not XORing it in (pointless if the other value is controlled by the
attacker, as the result would still collide), but using it as a mask to
choose the bits to take from the object hash?

René