Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode

Eric Wong <e@xxxxxxxxx> · Wed, 27 Aug 2025 19:08:16 +0000

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> wrote:
> TL;DR: We need a different datastore than a flat file for storing
> mappings between SHA-1 and SHA-256 in compatibility mode.  Advice and
> opinions sought.

<snip>

> Our approach for mapping object IDs between algorithms uses data in pack
> index v3 (outlined in the transition document), plus a flat file called
> `loose-object-idx` for loose objects.  However, we didn't anticipate
> that we'd need to handle mappings long-term for data that is neither a
> loose object nor a packed object.
> 
> For instance, with shallow clones, we must store a mapping for the
> shallows the server has sent us[1], since we lack the history to convert
> objects otherwise.  Similarly, if there are submodules or we're using a
> partial clone, we must store those mappings as well, since we cannot
> convert trees without them.  We can store them in the
> `loose-object-idx`, but since it's not sorted or easily searchable, it's
> going to perform really terribly when we store enough of them.  Right
> now, we read the entire file into two hashmaps (one in each direction)
> and we sometimes need to re-read it when other processes add items, so
> it won't take much to make it be slow and take a lot of memory.

This really seems ideal for SQLite, which has come a long way
since 2005 when git started.

I really wish git would've relied on more on existing formats
(e.g. LMDB refs) rather than introducing more one-off data
formats that require more cognitive overhead to document and
learn[1], especially when SQLite is extremely portable and works
on tiny devices.

> For these reasons, I think we need a different datastore for this and
> I'd like to solicit opinions on what that should look like.  Here are
> some things that come to mind:
> 
> * The format should be fast to read and relatively fast to write.
> * We need to efficiently read and map objects in both directions.  This
>   is required for many reasons, including efficient fetches and pushes.

SQLite seems to do these well, in my experience.  It's not the
fastest possible data store, but it's no slouch, either.

> * We still require an in-memory store because we stuff entries in their
>   without writing them during pack indexing and other operations, but
>   that doesn't mean we need to load data from the data files into the
>   in-memory structure (in fact, we probably should try to avoid it).

SQLite supports in-memory DBs, and also mmap.  I always prefer
to always put larger structures on TMPDIR and rely on page
cache; because sometimes code ends up running on machines with
too little memory/swap (but git has never been great w.r.t.
memory use :<).

> * We want to be able to write small updates to the data without having
>   to re-write the entire thing (e.g., `git add`).  We often know that
>   we'll be writing a whole batch at once, such as with shallows or
>   submodules from a clone or fetch, so many places in the code will be
>   able to start a batch and then write, but we shouldn't assume that
>   will always be the case.  (In other words, we will write more
>   frequently than we do packs or indexes.)

Transactions and atomicity are included, of course.

> * It would be helpful if we can determine the type of object being
>   stored.  For instance, if we've stored an object mapping because of a
>   shallow, `git gc` could remove that mapping if the shallows have been
>   updated and the mapping is no longer useful.

Column names should be enough.

> * We should try not to assume only two hash algorithms.  Pack index v3
>   allows for effectively an arbitrary number and while much of the
>   compatibility code assumes one main and one compatibility algorithm,
>   we should try to minimize that if possible.[2]

I haven't used it much, but ALTER TABLE should work well nowadays
for adding (maybe not removing) columns.

> * Being able to mmap it would be convenient, so if we can make it
>   relatively small, that's nice.

mmap is possible, but default builds of SQLite defaults to a
relatively small mmap limit (2G?).  I don't know why and never
bothered to deal sign up for their Fossil (JS required :<, last
I checked) to ask about the small default limit.

I don't like SQLite's approach to rejecting outside
contributions; but otherwise it's served me well with various
bits of Perl code for the last 15 years or so.  Yeah, the SQLite
developer doesn't have the highest opinion of git, but we
shouldn't let that affect our decision making.

[1] Fwiw, I enjoyed working on git a lot more when it used more
    high-level scripting glue.  I'm disappointed in the overall
    movement towards AOT languages (C, now Rust) due to large
    toolchains, slow builds + linkers.  Hacking was much more
    discoverable when I could just edit installed scripts like
    config files and not have to deal with builds at all :>