TL;DR: We need a different datastore than a flat file for storing mappings between SHA-1 and SHA-256 in compatibility mode. Advice and opinions sought. As I've mentioned earlier on the list, I'm working on some of the code for interoperability between SHA-1 and SHA-256. The good news is that it's relatively advanced so far. Right now, we have pack index v3 (so we can store both loose and packed objects) and it's possible to clone and fetch from and push to a single-algorithm server if the client repository supports both algorithms and there are no shallow clones, partial clones, or submodules involved. Those who are interested can look at my `sha256-interop` branch[0], learn more at my talk at Git Merge (which I'll be giving remotely), or talk to me at the Contributor Summit. Our approach for mapping object IDs between algorithms uses data in pack index v3 (outlined in the transition document), plus a flat file called `loose-object-idx` for loose objects. However, we didn't anticipate that we'd need to handle mappings long-term for data that is neither a loose object nor a packed object. For instance, with shallow clones, we must store a mapping for the shallows the server has sent us[1], since we lack the history to convert objects otherwise. Similarly, if there are submodules or we're using a partial clone, we must store those mappings as well, since we cannot convert trees without them. We can store them in the `loose-object-idx`, but since it's not sorted or easily searchable, it's going to perform really terribly when we store enough of them. Right now, we read the entire file into two hashmaps (one in each direction) and we sometimes need to re-read it when other processes add items, so it won't take much to make it be slow and take a lot of memory. For these reasons, I think we need a different datastore for this and I'd like to solicit opinions on what that should look like. Here are some things that come to mind: * The format should be fast to read and relatively fast to write. * We need to efficiently read and map objects in both directions. This is required for many reasons, including efficient fetches and pushes. * We still require an in-memory store because we stuff entries in their without writing them during pack indexing and other operations, but that doesn't mean we need to load data from the data files into the in-memory structure (in fact, we probably should try to avoid it). * We want to be able to write small updates to the data without having to re-write the entire thing (e.g., `git add`). We often know that we'll be writing a whole batch at once, such as with shallows or submodules from a clone or fetch, so many places in the code will be able to start a batch and then write, but we shouldn't assume that will always be the case. (In other words, we will write more frequently than we do packs or indexes.) * It would be helpful if we can determine the type of object being stored. For instance, if we've stored an object mapping because of a shallow, `git gc` could remove that mapping if the shallows have been updated and the mapping is no longer useful. * We should try not to assume only two hash algorithms. Pack index v3 allows for effectively an arbitrary number and while much of the compatibility code assumes one main and one compatibility algorithm, we should try to minimize that if possible.[2] * Being able to mmap it would be convenient, so if we can make it relatively small, that's nice. Some rough ideas of what this could look like: * We could repurpose the top-bit of the pack order value in pack index v3 to indicate an object that's not in the pack (this would limit us to 2^31 items per pack). * We could put this in new entries in multi-pack index and require that (although I'm not sure that I love the idea of requiring multi-pack index in all repositories and I have yet to implement compatibility mode there). * We could write some sort of quadratic rollup format like reftable. I would recommend reading the pack index v3 format documentation in `Documentation/technical/hash-function-transition.adoc`, since I think it's helpful to understand what we have now. I have implemented a small variant on it (documented in my branch) and will send some documentation updates before code, although the differences are minor and not relevant here. I've taken the liberty of CCing some people who have worked deeply with our existing formats (packs, indexes, reftable, and so on), but I've almost certainly missed some people and I'd love thoughts from anyone about this. Once we have an approach that we think is useful, I'm happy to write up a document for it and send it out. [0] Available from https://github.com/bk2204/git.git. [1] This assumes that the server also supports both algorithms to generate and send the mapping. I am implementing that work now and it has yet to be pushed to the branch (because it presently doesn't work). [2] We might find that SHA-256 becomes weak well before SHA-1 is completely dead and we need to deal with a third algorithm suddenly, so we should not mortgage our future unnecessarily. Cryptographic attacks only ever get better. -- brian m. carlson (they/them) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature