Re: [PATCH] describe: use khash in finish_depth_computation()

René Scharfe <l.s.r@xxxxxx> · Wed, 3 Sep 2025 17:41:49 +0200

On 9/3/25 4:31 PM, Jeff King wrote:
> On Tue, Sep 02, 2025 at 08:51:37PM +0200, René Scharfe wrote:
> 
>>> I suspect it won't make a big time difference. The old code should have
>>> been optimized down to a single word load, and now we have two word
>>> loads and an xor. That probably isn't important compared to the actual
>>> 5-word memcmp() we have to do in order to verify that we found the right
>>> bucket anyway.
>>
>> I see slightly worse performance, but within the noise.
>>
>> However, just stacking two words won't do if only a few bits of the
>> resulting hash will be used to find a bucket.  We could mix in more bits
>> and smear them all over, but if that's done by a deterministic function
>> then it could be applied during the construction of manipulated object
>> hash values as well, no?
> 
> I think the difficulty in manipulating scales as the number of bits
> increases. So yeah, if you are worried about the low 8 bits, then
> XOR-ing in another 8 bits is not going to do much. But your table is
> only 256 items long, so you don't care much either way.
> 
> At even 16 bits, it gets hard for the attacker to choose the low 16 bits
> _and_ the low 16 bits of the next word (you mentioned a project earlier
> which claims 28 bits). If you XOR in a third word, now your 16-bit hash
> is using 48 bits that the attacker has to control. And so on.
> 
>> Perhaps salting with a random value determined at runtime would help.
>> Not XORing it in (pointless if the other value is controlled by the
>> attacker, as the result would still collide), but using it as a mask to
>> choose the bits to take from the object hash?
> 
> I think that would work, but XOR-ing the higher order bits is easier to
> do and I think produces a similar effect. Let's shrink the problem for a
> second. Imagine sha1 was 16 bits, and we wanted to create an 8-bit hash
> to use in our table.  The attacker creates two objects with binary
> hashes:
> 
>   object a: 10111001 11110111
>   object b: 01001000 11110111
> 
> They collide in the lower 8 bits, but we don't want them to. In your
> scheme, as I understand it, we'd come up with a 16-bit mask that has
> exactly 8 bits set, like:
> 
>   11010110 01011000
> 
> and then picking only the bits where the mask is "1", we get:
> 
>   object a: 10111001 11110111
>       mask: 11010110 01011000
>     hash a: 10100110
> 
>   object b: 01001000 11110111
>       mask: 11010110 01011000
>     hash b: 01000110
> 
> So I agree that is hard to foil without the attacker knowing which bits
> you'll pick. You've made their job 8 bits harder, because they now have
> to control all 16 bits to get their collision.
> 
> But if we instead XOR the words of the object hashes together, we get:
> 
>   object a[hi]: 10111001
>   object a[lo]: 11110111
>         hash a: 01001110
> 
>   object b[hi]: 01001000
>   object b[lo]: 11110111
>         hash b: 10111111
> 
> So you're flipping bits "randomly". It's not truly random, but is coming
> from the rest of the hash the attacker provided. But for any bit they
> want to control, they have to control that position in both words. So
> they're back to needing to control all 16 bits to get their desired
> hash.
> 
> And as somebody who just hand-computed those answers, I can tell you
> that the XOR one is much simpler to do. ;)
No doubt. :)  Using a deterministic function is easier, but also allows
an attacker to use it for finding object hash values that yield
colliding hash table hashes.

How does an attacker control object hashes?  Hash it, check if it fits
the criteria, if it doesn't then make some inconsequential changes like
adding whitespace to a commit message and repeat.  That criteria can be
"bits 1-16 are all zero", but it can just as well be "bits 1-8 XORed
with bits 9-16 are all zero".  For the former they'd have to roll the
dice in the order of 2^16 times, for the latter just 2^8 times.

The attacker in our scenario doesn't have to care about the individual
bits of object hashes, just the resulting hash table hashes, which
reduces their search space a lot.  Making the deterministic function
more complicated or using more attacker-supplied input bits doesn't
change that.

René