Re: [PATCH v2 8/9] object-store: remove global array of cached objects

Patrick Steinhardt <ps@xxxxxx> · Tue, 15 Apr 2025 11:19:13 +0200

On Fri, Apr 11, 2025 at 03:58:03PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@xxxxxx> writes:
> 
> > Cached objects are virtual objects that can be set up without writing
> > anything into the object store directly. This mechanism for example
> > allows us to create fake commits in git-blame(1).
> >
> > The cached objects are stored in a global variable. Refactor the code so
> > that we instead store the array as part of the raw object store. This is
> > another step into the direction of libifying our object database.
> 
> While we do need some execution context object to hang these virtual
> objects, once we decide that it cannot be global, I am not sure if
> epository objects are good home for them.  If your application
> running in a repository needs to give one object name to a virtual
> object, and then that same application wants to access a submodule
> of that repository in the same process image, wouldn't you have one
> in-core repository object for the top-level superproject, and one
> for each submodule?  If a submodule commit bound to a path in the
> superproject's tree is a viertual "pretend" commit object or if it
> has a virtual "pretend" tree object, don't you need to expose these
> to both submodule and superproject repositories, if your application
> wants to seamlessly cross the module boundary (think "git grep
> --recurse-submodules" or something)?
> 
> For now, as long as the_repository is being used as that "execution
> context object", and not a repository instance passed along the call
> chain, then the globalness of these virtual objects is maintained,
> so this change will not cause breakage (e.g., such an application
> may want to pick up the virtual object from the repository instance
> for the superproject and it may find it, but when traversing down to
> a submdoule, the same virtual object may not be found in the
> repository instance for the submodule it descended into and working
> in, if you make it per repository and pass repository instance
> around along the call chain).  But eventually somebody will start
> saying "let's remove USE_THE_REPOSITORY_VARIABLE", at which point I
> am not sure how subtle such a bug would become.

I think the answer is very much "it depends". I can think of usecases
where it might be the right to pretend objects to exist globally, but
there's also usecases where I think it makes sense to treat them as
repository-specific. The thing is: we can do the former if the virtual
objects are specific to a repository, but we can't do the latter if the
virtual objects are global.

As far as I can see we only use this mechanism in git-blame(1) right now
to create a fake working tree commit. This mechanism does not cross into
submodules at all, and if it would I think we would want to create two
separate fake working tree commits anyway: one for the parent
repository, and one for each submodule. So converting this mechanism to
be local to the repository (or rather local to an object store) feels
like the right thing to do to me.

But I agree with you in principle: we will have to be a lot more mindful
going forward as it comes to handling multiple repositories in-memory.
We don't do this well right now, but as we convert more and more code so
that it doesn't use `the_repository` anymore we'll have to become better
at this indeed. From my perspective that isn't only true for these fake
working tree commits, but it's a general thing that we'll have to sort
out over time. It's inherent to the whole libifcation process.

I think for the most part we're fine right now, as we don't make use of
any of the new capabilities that libifcation brings with it in theory.
But once usecases start to come up that _do_ make use of this we will
have to think about those issues a whole lot more carefully.

Patrick