Re: [PATCH 0/6] odb: track commit graphs via object source

Patrick Steinhardt <ps@xxxxxx> · Wed, 10 Sep 2025 13:38:02 +0200

On Mon, Sep 08, 2025 at 10:46:40AM -0400, Derrick Stolee wrote:
> On 9/8/2025 7:17 AM, Patrick Steinhardt wrote:
> > I (probably unsurprisingly :)) don't quite agree with this.
> 
> I think I can summarize the main point you seem to be making with
> this quote:
> 
> > So I would claim that the commit graph is specifically tied to the
> > actual storage format of objects, and it's not at all obvious that it
> > would need to exist if we had a different storage format.
> 
> I think I agree in principle, if you are saying "different storage
> format" means "different commit object information" which then means
> we are talking about a completely different object type that is not
> at all compatible with current Git.

I don't plan to introduce a new type as in commit, blob or tag, but
really only the possibility to have a different format for how exactly
the objects are stored. So it's not really "different commit object
information" that I want to store. It's rather "additional commit object
information" that we currently have to store out-of-band, which includes
items like generation numbers or bloom filters.

As an example, if the object format stores data in the cloud it would be
sensible to also store generation numbers there, as it would ensure that
we don't have to recompute them on every client. In other words, the
data that is currently stored locally in the commit-graph also becomes
fully distributed.

> You could store commit and object data in SQLite, in the cloud, or
> via plaintext files on disk. As long as the data is still representing
> commit objects as we format them today, the commit-graph is still a
> cache that can be used as a faster way to fill 'struct commit' objects
> in memory without navigating to that object database.
> 
> And you also mention that the commit-graph format itself could be
> more efficient. You're right! I think the way we did it within Azure
> DevOps is more efficient, because most of the commit walking algorithms
> are built working directly on the integer labels within the in-memory
> data structure instead of operating on commit structs. This allows for
> less overhead when loading the graph (it's already cached in memory)
> and when walking thousands of commits (we only translate to object IDs
> if they are important for the output). But this is all the more reason
> to keep the commit-graph structures outside of "the object store" since
> a "commit-graph database" can be implemented without being tied to an
> object store.
> 
> If you are saying "but our existing commit-graph format puts it in
> 'objects/info/commit-graph[s]'" then yes the storage of a commit-graph
> is tied to our storage of objects. But the way we interact with it in
> code is in some way a layer above that.

There is no inherent reason why a new backend would not be able to use
the existing commit-graph infrastructure indeed. But there are reasons
that specific backends may not want to do so. If objects are already
stored in a database table, then it may make way more sense to store
additional metadata that is currently stored in the commit-graph in a
secondary database table instead of in the commit graph.

So that raises the question who is going to decide what caching format
to use. That is, given a repository, do we want to use a commit-graph or
do we rather want to store that data in the database?

I think the most sensible way to decide this is by going via the backend
of a specific object source. The backend knows how objects are stored,
so it'll also know whether there is a better way to store metadata than
via commit graphs. If it is the "files" backend it will decide to
consult the commit-graph. If it is a SQlite backend it _may_ make sense
to use a commit-graph, as well. But if it's a remote database for
example it may make more sense to store the information in that database
directly.

By moving this logic into the object source we can tie this decision to
the object source and also abstract it away. In the pluggable object
database world we can make this data available via a couple of function
pointers that are per object source:

  - A function to check whether a cached representation of the graph
    exists in a specific source.

  - A function to load bloom filters and generation numbers,
    respectively via that source.

  - A function to load a commit via the cached representation of its
    source.

Thus, the details around the actual data format will be hidden away and
most of the code never has to care whether it's a commit graph that
contains the data or whether it's stored in a distributed database.

Now it's not impossible to make this work in your proposed world where
the commit graph continues to sit at the database level. But we'd still
have to abstract away logic so that we can have different ways to store
cached data. We could for example implement logic to ask every backend
whether it has a cached representation of the commit graph, and if so,
to provide an opaque data structure that contains the above function
pointers. We could then continue to store that structure on the database
level and use it whenever available. But I'm not sure whether that
design is in any way more obvious -- quite on the contrary, I expect
that it may be more complex.

This is roughly what I have in my head right now. And I realize that
this information really should be sitting in a design document. I'm
working on that, but still need to land two more patch series before I
want to send such a patch series to the list.

Thanks for the discussion by the way, really appreciate it!

Patrick