On Mon, Sep 08, 2025 at 01:17:39PM +0200, Patrick Steinhardt wrote: > On Fri, Sep 05, 2025 at 02:29:50PM -0400, Derrick Stolee wrote: > > On 9/4/2025 7:12 PM, Junio C Hamano wrote: > > > Patrick Steinhardt <ps@xxxxxx> writes: > > > > > >> commit graphs are currently stored on the object database level. This > > >> doesn't really make much sense conceptually, given that commit graphs > > >> are specific to one object source. Furthermore, with the upcoming > > >> pluggable object database effort, an object source's backend may not > > >> evene have a commit graph in the first place but store that information > > >> in a different format altogether. > > >> > > >> This patch series prepares for that by moving the commit graph from > > >> `struct object_database` into `struct odb_source`. > > > > > > Hmph, I am finding the above hard to agree with at the conceptual > > > level. In some future, we may use multiple object stores in a > > > single repository. Perhaps we would be storing older parts of > > > history in semi-online storage while newer parts are stored in > > > readily available storage. But the side data structure that allows > > > us to quickly learn who are parents of one commit is without having > > > to go to the object store in order to parse the actualy commit > > > object can be stored for the entire history if we wanted to, or more > > > recent part of the history but not limited to the "readily available > > > storage" part. IOW, where the boundary between the older and the > > > newer parts of the history lies and which commits the commit graph > > > covers should be pretty much independent. > > > > > > So moving from object_database (i.e. the whole world) to individual > > > odb_source (i.e. where one particular subset of the history is > > > stored) feels like totally backwards to me. Surely, a commit graph > > > file may be defined over a set of packfiles and remaining loose > > > object files, but it is not like an instance of the commit-graph > > > file is tied to packfiles in the sense that it uses the index into > > > some packfile instead of the actual object names to refer to > > > commits, or anything like that (this is quite different from other > > > files that are very specific to a single object store, like midx > > > that is tied to the packfiles it describes). > > > > This is an interesting aspect to things, where the commit-graph file > > is a "structured cache" of certain commit information. It happens to > > be located within the object stores (either local or in an alternate) > > but is conceptually different in a few ways. > > > > The biggest difference is that you can only open one commit-graph > > (or chain of commit-graphs). Having multiple files across different > > object stores will not accumulate additional context. Instead, we > > have a "first one wins" approach. > > > > This does seem to be something that you are attempting to change > > by including the ability to load a commit-graph for each odb (and > > closing them in sequence as we close a repo). > > > > So in this sense, the commit-graph lives at the repository level, > > not an object store level. When doing I/O to write or read a graph, > > we use a specific object store at a time. > > > > The other direction to consider is what context we have when we > > interact with a commit-graph. We generally are parsing commits from > > a repository or loading Bloom filter data during file history walks. > > Each of these do not have a predictable nature of which object store > > will "own" the commit we are inspecting, so it wouldn't make sense > > to restrict things like odb_parse_commit() over repo_parse_commit(). > > > > With these thoughts in mind, I have these big-picture thoughts: > > > > 1. Patches 1-5 are great. Nice cleanups. > > > > 2. Some of Patch 6 is great, including having the I/O methods use > > an odb_source to help focus the specific location of the files > > being read or written. However, the movement of the struct into > > the odb_source makes less sense and should still exist at the > > object_database level. > > I (probably unsurprisingly :)) don't quite agree with this. > > Let's take a step back: why does the commit-graph exist in the first > place? It basically provides a caching mechanism to efficiently return > information that is otherwise more expensive to obtain: > > - It contains a cached representation of the graph so that we don't > have to parse each commit from the object database. > > - It encodes generation numbers. > > - It contains bloom filters. I don't think these three things are all on the same conceptual footing. In the current object storage format, yes, part of the commit-graph's responsibility is to provide fast access to the repository's commits and their ancestors. That is a pure optimization that is inherently tied to the object storage format, which necessitates such an optimization for certain operations to be efficient. Generation numbers and changed-path Bloom filters are different. Yes, they *can* be used to optimize certain operations, but they also provide *new* information that cannot be otherwise stored in the existing object format. Suppose that we have some future object storage backend that stores commit objects in such a way that querying them natively through whatever mechanism that store provides is already as efficient as using a commit-graph file on-disk today. That eliminates the "pure optimization" aspect of the commit-graph. But that object store implementation will still want to implement additional features like generation numbers and changed-path Bloom filters, because those provide new information that is not purely about optimizing reads. > All of which makes sense with the current design of our object storage > format, because obtaining this information can be quite expensive. But > let's consider a different world where we for example store objects in a > proper database: > > - This database may have an efficient way to compute generation > numbers on the fly, either when reading an object or when writing it > to disk. We cannot currently store that information in the packfile > right now, so it needs to be stored out-of-band. But with a database > there is no reason why we couldn't immediately compute and store the > generation number on each insert. Yes, but that does not mean that other parts of the code are not themselves interested in what generation number a commit has for their own purposes. > - This database may have an efficient way to store bloom filters next > to a specific commit directly, without requiring a separate file. Same thought as above. > So I would claim that the commit graph is specifically tied to the > actual storage format of objects, and it's not at all obvious that it > would need to exist if we had a different storage format. I am not sure I agree here. What I'm trying to say is that though it's easy to imagine an object storage format that stores objects in such a way that the commit-graph file's representation of commits is unnecessary, there is additional information in what we currently call the commit-graph that is useful regardless of how fast the actual object store is. > The goal of this patch series is thus explicitly _not_ to allow loading > one commit graph per object source. In fact, the refactorings I did > ensure that we still only ever load a single commit graph. Right, though if I understand correctly I think there *is* a behavior change here in that we will happily fall back to an alternate ODB's commit-graph if the main object store does not have one. That is not how it works today. Thanks, Taylor