Re: [PATCH] describe: use khash in finish_depth_computation()

Jeff King <peff@xxxxxxxx> · Wed, 3 Sep 2025 10:31:34 -0400

On Tue, Sep 02, 2025 at 08:51:37PM +0200, René Scharfe wrote:

> > I suspect it won't make a big time difference. The old code should have
> > been optimized down to a single word load, and now we have two word
> > loads and an xor. That probably isn't important compared to the actual
> > 5-word memcmp() we have to do in order to verify that we found the right
> > bucket anyway.
>
> I see slightly worse performance, but within the noise.
> 
> However, just stacking two words won't do if only a few bits of the
> resulting hash will be used to find a bucket.  We could mix in more bits
> and smear them all over, but if that's done by a deterministic function
> then it could be applied during the construction of manipulated object
> hash values as well, no?

I think the difficulty in manipulating scales as the number of bits
increases. So yeah, if you are worried about the low 8 bits, then
XOR-ing in another 8 bits is not going to do much. But your table is
only 256 items long, so you don't care much either way.

At even 16 bits, it gets hard for the attacker to choose the low 16 bits
_and_ the low 16 bits of the next word (you mentioned a project earlier
which claims 28 bits). If you XOR in a third word, now your 16-bit hash
is using 48 bits that the attacker has to control. And so on.

> Perhaps salting with a random value determined at runtime would help.
> Not XORing it in (pointless if the other value is controlled by the
> attacker, as the result would still collide), but using it as a mask to
> choose the bits to take from the object hash?

I think that would work, but XOR-ing the higher order bits is easier to
do and I think produces a similar effect. Let's shrink the problem for a
second. Imagine sha1 was 16 bits, and we wanted to create an 8-bit hash
to use in our table.  The attacker creates two objects with binary
hashes:

  object a: 10111001 11110111
  object b: 01001000 11110111

They collide in the lower 8 bits, but we don't want them to. In your
scheme, as I understand it, we'd come up with a 16-bit mask that has
exactly 8 bits set, like:

  11010110 01011000

and then picking only the bits where the mask is "1", we get:

  object a: 10111001 11110111
      mask: 11010110 01011000
    hash a: 10100110

  object b: 01001000 11110111
      mask: 11010110 01011000
    hash b: 01000110

So I agree that is hard to foil without the attacker knowing which bits
you'll pick. You've made their job 8 bits harder, because they now have
to control all 16 bits to get their collision.

But if we instead XOR the words of the object hashes together, we get:

  object a[hi]: 10111001
  object a[lo]: 11110111
        hash a: 01001110

  object b[hi]: 01001000
  object b[lo]: 11110111
        hash b: 10111111

So you're flipping bits "randomly". It's not truly random, but is coming
from the rest of the hash the attacker provided. But for any bit they
want to control, they have to control that position in both words. So
they're back to needing to control all 16 bits to get their desired
hash.

And as somebody who just hand-computed those answers, I can tell you
that the XOR one is much simpler to do. ;)

-Peff