Re: How GitLab does/doesn't need change IDs (was Re: Semantics of change IDs)

Nico Williams <nico@xxxxxxxxxxxxxxxx> · Wed, 23 Apr 2025 13:59:29 -0500

On Wed, Apr 23, 2025 at 02:58:49PM +0200, Toon Claes wrote:
> Nico Williams <nico@xxxxxxxxxxxxxxxx> writes:
> 
> At GitLab we keep track of the commit IDs a branch has been (maybe only
> if there is a Merge Request for that branch, I'm not sure). [...]

Do you mean "we keep track of the commit _hashes_ a branch has _seen_"?
But it can't be commit hashes, and there's no commit IDs, so GL could be
assigning synthetic, internal commit IDs based on commit similarity,
which proves Junio's and Theodore's point that similarity checking can
be enough.

> > The point is that GL demonstrates that these things can be done.  And I
> > don't see how a change ID would have helped GL much except in cases
> > where one re-does all the commits with different subject lines etc, but
> > leaves the actual patches mostly the same.  Now it does happen that I
> > split and squash commits, but it's rare that I completely redo them.
> 
> That's because GL stores history about a branch ref (outside the Git
> object/ref database). If you don't do that, you can't. Having a
> Change-Id embedded in the commit, retains that information in Git's DB.

I.e., GL has an internal reflog on the server side.  I've sometimes
wished that I could push and fetch reflogs (or subsets thereof anyways).

When doing code reviews I use [local, obv.] reflogs to see the diffs
between an earlier version of a branch that I fetched and reviewed
earlier and the latest that I just fetched and am reviewing, and
generally I don't need to see any other versions I never fetched, but
occasionally I've wished I could fetch those other versions, but since
there are no server-side refs for them, I can't.  [Or maybe I'm about to
learn of some feature I didn't know about :)]

I agree that change IDs / commit IDs in commit headers can help one keep
track of versions of a branch w/o a server-side reflog, but how would
you keep track of their chnronology?  I.e., how do you know which is
version 1, which is version 2, .., and which is version N-1?  (Version N
being the head of the branch.)  If you don't index these then finding
them is a full table scan, and if you index them then you've implemented
a server-side reflog.

Which makes me think that all that's needed for a good CR tool here is
a) a server-side reflog, b) similarity checking for commits.  (a)
doesn't seem like a radical idea (that can be implemented with server
side hooks), and (b) is also not radical given that file rename / copy
operations are detected by Git using similarity checking already.