Re: [PATCH 0/6] odb: track commit graphs via object source

Taylor Blau <me@xxxxxxxxxxxx> · Thu, 11 Sep 2025 19:08:48 -0400

On Mon, Sep 08, 2025 at 01:17:39PM +0200, Patrick Steinhardt wrote:
> On Fri, Sep 05, 2025 at 02:29:50PM -0400, Derrick Stolee wrote:
> > On 9/4/2025 7:12 PM, Junio C Hamano wrote:
> > > Patrick Steinhardt <ps@xxxxxx> writes:
> > >
> > >> commit graphs are currently stored on the object database level. This
> > >> doesn't really make much sense conceptually, given that commit graphs
> > >> are specific to one object source. Furthermore, with the upcoming
> > >> pluggable object database effort, an object source's backend may not
> > >> evene have a commit graph in the first place but store that information
> > >> in a different format altogether.
> > >>
> > >> This patch series prepares for that by moving the commit graph from
> > >> `struct object_database` into `struct odb_source`.
> > >
> > > Hmph, I am finding the above hard to agree with at the conceptual
> > > level.  In some future, we may use multiple object stores in a
> > > single repository.  Perhaps we would be storing older parts of
> > > history in semi-online storage while newer parts are stored in
> > > readily available storage.  But the side data structure that allows
> > > us to quickly learn who are parents of one commit is without having
> > > to go to the object store in order to parse the actualy commit
> > > object can be stored for the entire history if we wanted to, or more
> > > recent part of the history but not limited to the "readily available
> > > storage" part.  IOW, where the boundary between the older and the
> > > newer parts of the history lies and which commits the commit graph
> > > covers should be pretty much independent.
> > >
> > > So moving from object_database (i.e. the whole world) to individual
> > > odb_source (i.e. where one particular subset of the history is
> > > stored) feels like totally backwards to me.  Surely, a commit graph
> > > file may be defined over a set of packfiles and remaining loose
> > > object files, but it is not like an instance of the commit-graph
> > > file is tied to packfiles in the sense that it uses the index into
> > > some packfile instead of the actual object names to refer to
> > > commits, or anything like that (this is quite different from other
> > > files that are very specific to a single object store, like midx
> > > that is tied to the packfiles it describes).
> >
> > This is an interesting aspect to things, where the commit-graph file
> > is a "structured cache" of certain commit information. It happens to
> > be located within the object stores (either local or in an alternate)
> > but is conceptually different in a few ways.
> >
> > The biggest difference is that you can only open one commit-graph
> > (or chain of commit-graphs). Having multiple files across different
> > object stores will not accumulate additional context. Instead, we
> > have a "first one wins" approach.
> >
> > This does seem to be something that you are attempting to change
> > by including the ability to load a commit-graph for each odb (and
> > closing them in sequence as we close a repo).
> >
> > So in this sense, the commit-graph lives at the repository level,
> > not an object store level. When doing I/O to write or read a graph,
> > we use a specific object store at a time.
> >
> > The other direction to consider is what context we have when we
> > interact with a commit-graph. We generally are parsing commits from
> > a repository or loading Bloom filter data during file history walks.
> > Each of these do not have a predictable nature of which object store
> > will "own" the commit we are inspecting, so it wouldn't make sense
> > to restrict things like odb_parse_commit() over repo_parse_commit().
> >
> > With these thoughts in mind, I have these big-picture thoughts:
> >
> > 1. Patches 1-5 are great. Nice cleanups.
> >
> > 2. Some of Patch 6 is great, including having the I/O methods use
> >    an odb_source to help focus the specific location of the files
> >    being read or written. However, the movement of the struct into
> >    the odb_source makes less sense and should still exist at the
> >    object_database level.
>
> I (probably unsurprisingly :)) don't quite agree with this.
>
> Let's take a step back: why does the commit-graph exist in the first
> place? It basically provides a caching mechanism to efficiently return
> information that is otherwise more expensive to obtain:
>
>   - It contains a cached representation of the graph so that we don't
>     have to parse each commit from the object database.
>
>   - It encodes generation numbers.
>
>   - It contains bloom filters.

I don't think these three things are all on the same conceptual footing.

In the current object storage format, yes, part of the commit-graph's
responsibility is to provide fast access to the repository's commits and
their ancestors. That is a pure optimization that is inherently tied to
the object storage format, which necessitates such an optimization for
certain operations to be efficient.

Generation numbers and changed-path Bloom filters are different. Yes,
they *can* be used to optimize certain operations, but they also provide
*new* information that cannot be otherwise stored in the existing object
format.

Suppose that we have some future object storage backend that stores
commit objects in such a way that querying them natively through
whatever mechanism that store provides is already as efficient as using
a commit-graph file on-disk today. That eliminates the "pure
optimization" aspect of the commit-graph.

But that object store implementation will still want to implement
additional features like generation numbers and changed-path Bloom
filters, because those provide new information that is not purely about
optimizing reads.

> All of which makes sense with the current design of our object storage
> format, because obtaining this information can be quite expensive. But
> let's consider a different world where we for example store objects in a
> proper database:
>
>   - This database may have an efficient way to compute generation
>     numbers on the fly, either when reading an object or when writing it
>     to disk. We cannot currently store that information in the packfile
>     right now, so it needs to be stored out-of-band. But with a database
>     there is no reason why we couldn't immediately compute and store the
>     generation number on each insert.

Yes, but that does not mean that other parts of the code are not
themselves interested in what generation number a commit has for their
own purposes.

>   - This database may have an efficient way to store bloom filters next
>     to a specific commit directly, without requiring a separate file.

Same thought as above.

> So I would claim that the commit graph is specifically tied to the
> actual storage format of objects, and it's not at all obvious that it
> would need to exist if we had a different storage format.

I am not sure I agree here. What I'm trying to say is that though it's
easy to imagine an object storage format that stores objects in such a
way that the commit-graph file's representation of commits is
unnecessary, there is additional information in what we currently call
the commit-graph that is useful regardless of how fast the actual object
store is.

> The goal of this patch series is thus explicitly _not_ to allow loading
> one commit graph per object source. In fact, the refactorings I did
> ensure that we still only ever load a single commit graph.

Right, though if I understand correctly I think there *is* a behavior
change here in that we will happily fall back to an alternate ODB's
commit-graph if the main object store does not have one. That is not how
it works today.

Thanks,
Taylor