Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



TL;DR: We need a different datastore than a flat file for storing
mappings between SHA-1 and SHA-256 in compatibility mode.  Advice and
opinions sought.

As I've mentioned earlier on the list, I'm working on some of the code
for interoperability between SHA-1 and SHA-256.  The good news is that
it's relatively advanced so far.

Right now, we have pack index v3 (so we can store both loose and packed
objects) and it's possible to clone and fetch from and push to a
single-algorithm server if the client repository supports both
algorithms and there are no shallow clones, partial clones, or
submodules involved.  Those who are interested can look at my
`sha256-interop` branch[0], learn more at my talk at Git Merge (which
I'll be giving remotely), or talk to me at the Contributor Summit.

Our approach for mapping object IDs between algorithms uses data in pack
index v3 (outlined in the transition document), plus a flat file called
`loose-object-idx` for loose objects.  However, we didn't anticipate
that we'd need to handle mappings long-term for data that is neither a
loose object nor a packed object.

For instance, with shallow clones, we must store a mapping for the
shallows the server has sent us[1], since we lack the history to convert
objects otherwise.  Similarly, if there are submodules or we're using a
partial clone, we must store those mappings as well, since we cannot
convert trees without them.  We can store them in the
`loose-object-idx`, but since it's not sorted or easily searchable, it's
going to perform really terribly when we store enough of them.  Right
now, we read the entire file into two hashmaps (one in each direction)
and we sometimes need to re-read it when other processes add items, so
it won't take much to make it be slow and take a lot of memory.

For these reasons, I think we need a different datastore for this and
I'd like to solicit opinions on what that should look like.  Here are
some things that come to mind:

* The format should be fast to read and relatively fast to write.
* We need to efficiently read and map objects in both directions.  This
  is required for many reasons, including efficient fetches and pushes.
* We still require an in-memory store because we stuff entries in their
  without writing them during pack indexing and other operations, but
  that doesn't mean we need to load data from the data files into the
  in-memory structure (in fact, we probably should try to avoid it).
* We want to be able to write small updates to the data without having
  to re-write the entire thing (e.g., `git add`).  We often know that
  we'll be writing a whole batch at once, such as with shallows or
  submodules from a clone or fetch, so many places in the code will be
  able to start a batch and then write, but we shouldn't assume that
  will always be the case.  (In other words, we will write more
  frequently than we do packs or indexes.)
* It would be helpful if we can determine the type of object being
  stored.  For instance, if we've stored an object mapping because of a
  shallow, `git gc` could remove that mapping if the shallows have been
  updated and the mapping is no longer useful.
* We should try not to assume only two hash algorithms.  Pack index v3
  allows for effectively an arbitrary number and while much of the
  compatibility code assumes one main and one compatibility algorithm,
  we should try to minimize that if possible.[2]
* Being able to mmap it would be convenient, so if we can make it
  relatively small, that's nice.

Some rough ideas of what this could look like:

* We could repurpose the top-bit of the pack order value in pack index
  v3 to indicate an object that's not in the pack (this would limit us
  to 2^31 items per pack).
* We could put this in new entries in multi-pack index and require that
  (although I'm not sure that I love the idea of requiring multi-pack
  index in all repositories and I have yet to implement compatibility
  mode there).
* We could write some sort of quadratic rollup format like reftable.

I would recommend reading the pack index v3 format documentation in
`Documentation/technical/hash-function-transition.adoc`, since I think
it's helpful to understand what we have now.  I have implemented a small
variant on it (documented in my branch) and will send some documentation
updates before code, although the differences are minor and not relevant
here.

I've taken the liberty of CCing some people who have worked deeply with
our existing formats (packs, indexes, reftable, and so on), but I've
almost certainly missed some people and I'd love thoughts from anyone
about this.  Once we have an approach that we think is useful, I'm happy
to write up a document for it and send it out.

[0] Available from https://github.com/bk2204/git.git.
[1] This assumes that the server also supports both algorithms to
generate and send the mapping.  I am implementing that work now and it
has yet to be pushed to the branch (because it presently doesn't work).
[2] We might find that SHA-256 becomes weak well before SHA-1 is
completely dead and we need to deal with a third algorithm suddenly, so
we should not mortgage our future unnecessarily.  Cryptographic attacks
only ever get better.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux