On Mon, Sep 08, 2025 at 10:46:40AM -0400, Derrick Stolee wrote: > On 9/8/2025 7:17 AM, Patrick Steinhardt wrote: > > I (probably unsurprisingly :)) don't quite agree with this. > > I think I can summarize the main point you seem to be making with > this quote: > > > So I would claim that the commit graph is specifically tied to the > > actual storage format of objects, and it's not at all obvious that it > > would need to exist if we had a different storage format. > > I think I agree in principle, if you are saying "different storage > format" means "different commit object information" which then means > we are talking about a completely different object type that is not > at all compatible with current Git. I don't plan to introduce a new type as in commit, blob or tag, but really only the possibility to have a different format for how exactly the objects are stored. So it's not really "different commit object information" that I want to store. It's rather "additional commit object information" that we currently have to store out-of-band, which includes items like generation numbers or bloom filters. As an example, if the object format stores data in the cloud it would be sensible to also store generation numbers there, as it would ensure that we don't have to recompute them on every client. In other words, the data that is currently stored locally in the commit-graph also becomes fully distributed. > You could store commit and object data in SQLite, in the cloud, or > via plaintext files on disk. As long as the data is still representing > commit objects as we format them today, the commit-graph is still a > cache that can be used as a faster way to fill 'struct commit' objects > in memory without navigating to that object database. > > And you also mention that the commit-graph format itself could be > more efficient. You're right! I think the way we did it within Azure > DevOps is more efficient, because most of the commit walking algorithms > are built working directly on the integer labels within the in-memory > data structure instead of operating on commit structs. This allows for > less overhead when loading the graph (it's already cached in memory) > and when walking thousands of commits (we only translate to object IDs > if they are important for the output). But this is all the more reason > to keep the commit-graph structures outside of "the object store" since > a "commit-graph database" can be implemented without being tied to an > object store. > > If you are saying "but our existing commit-graph format puts it in > 'objects/info/commit-graph[s]'" then yes the storage of a commit-graph > is tied to our storage of objects. But the way we interact with it in > code is in some way a layer above that. There is no inherent reason why a new backend would not be able to use the existing commit-graph infrastructure indeed. But there are reasons that specific backends may not want to do so. If objects are already stored in a database table, then it may make way more sense to store additional metadata that is currently stored in the commit-graph in a secondary database table instead of in the commit graph. So that raises the question who is going to decide what caching format to use. That is, given a repository, do we want to use a commit-graph or do we rather want to store that data in the database? I think the most sensible way to decide this is by going via the backend of a specific object source. The backend knows how objects are stored, so it'll also know whether there is a better way to store metadata than via commit graphs. If it is the "files" backend it will decide to consult the commit-graph. If it is a SQlite backend it _may_ make sense to use a commit-graph, as well. But if it's a remote database for example it may make more sense to store the information in that database directly. By moving this logic into the object source we can tie this decision to the object source and also abstract it away. In the pluggable object database world we can make this data available via a couple of function pointers that are per object source: - A function to check whether a cached representation of the graph exists in a specific source. - A function to load bloom filters and generation numbers, respectively via that source. - A function to load a commit via the cached representation of its source. Thus, the details around the actual data format will be hidden away and most of the code never has to care whether it's a commit graph that contains the data or whether it's stored in a distributed database. Now it's not impossible to make this work in your proposed world where the commit graph continues to sit at the database level. But we'd still have to abstract away logic so that we can have different ways to store cached data. We could for example implement logic to ask every backend whether it has a cached representation of the commit graph, and if so, to provide an opaque data structure that contains the above function pointers. We could then continue to store that structure on the database level and use it whenever available. But I'm not sure whether that design is in any way more obvious -- quite on the contrary, I expect that it may be more complex. This is roughly what I have in my head right now. And I realize that this information really should be sitting in a design document. I'm working on that, but still need to land two more patch series before I want to send such a patch series to the list. Thanks for the discussion by the way, really appreciate it! Patrick