Semantics of change IDs (Re: Gerrit, GitButler, and Jujutsu projects collaborating on change-id commit footer)

Nico Williams <nico@xxxxxxxxxxxxxxxx> · Wed, 9 Apr 2025 11:54:10 -0500

On Wed, Apr 09, 2025 at 08:19:24AM -0400, Theodore Ts'o wrote:
> On Tue, Apr 08, 2025 at 10:53:06AM -0500, Nico Williams wrote:
> > I'm not keen on CR tools "intuiting" from.. similarity checks.
> > [...]
> 
> I'm not keen on fields that can have essentially random semantics.
> Part of this is because today Change-ID is in the footer, and so
> humans can randomly set it to any value they like.  Sometimes they cut
> and paste footers, and so completely unrelated commits have the same
> Change-Id which show up when you do a Gerrit lookup by Chnage-Id.
> Admittedly, this aspect gets better if we shove it into the git commit
> header.
>
> Part of it is because some tools will edit the Change-Id when doing a
> cherry-pick.  [...]

I was only proposing to leave some details out, not to have completely
undefined semantics.  The particular details we might want to leave out
are about resolving change IDs to URIs.  In particular this editing of
change IDs on cherry-pick you mention has to not be permitted, or
perhaps a new change ID could be added -- i.e., are these headers
single-valued or multi-valued?

Let's nail down the semantics of these change ID headers.  Here is a
proposal to bang on:

 - change IDs get preserved on cherry-pick and on `pick`s in rebases

 - users can manually remove or change these change IDs, naturally,
   though generall they would not

 - the actual change IDs are either free-form or they are URIs -- pick
   one, but if they are URIs they should be URIs to CRs, and approved
   CRs should perhaps have links to integration reports etc.

 - there should be one header for a change ID for the patch series (the
   MR/PR/whateverR); patch series IDs can be shared by many commits in
   one branch, so they are not in any way unique

 - there may be one header for a change ID for each commit, which should
   be unique in any _branch_, but not unique in any repo (due to back-
   and forward-ports for example)

 - there should be another header to list change IDs from which a commit
   was derived that nonetheless has a different commit change ID

 - these headers should be multi-valued to handle squashes and merges

 - if a commit change ID is missing but a path series change ID is
   present then similarity checks could be used to link multiple
   versions of any one such commit

Optional:

 - a commit change ID could be used as a ref to an object that lists the
   commits that have that change ID

 - a patch series change ID could be used as a ref to an object that lists
   the head commit of of that patch series in every branch that contains
   it

> Perhaps one approach might be that the hueristics that you hate being
> used as an automated way to sort it out, might get used to set the
> semantics at commit time, with perhaps a way for the user to override
> the hueristics, or where the user has to explicitly acknowledge that
> the hueristics correctly noticed that the patch has changed radically
> and maybe the Change-Id shouldn't be retained any more?

Yes, heuristics can be used to help the user make such decisions.  I've
no issue with that.

> Finally, perhaps there should be some discussion about whether we
> think git should be maintaining indexes based on the Commit-Id.

If they can be refs, then they should be.  Since they can't be unique
the ref should be to an object listing the actual commits (see above).

There could also be a non-ref index for these.

> Personally, cutting and pasting a random 17 character ID is painful
> and annoying, and when I see it in my shell history, I have no idea
> what might have been going on.  So if I need to cut and paste a
> Commit-Id, I might as well cut and paste the one-line commit summary,
> and do a "git log --grep" search based on that.  But if the Commit-Id
> is indexed, then maybe it might be more useful?  I dunno....

+1

> Well, see above about some possible semantics.  I'm *still* not
> convinced even with the better-defined semantics it's worth storing
> the extra baggage in the commit header.  But that's more of a
> value/philosophical question, much like how we "could" store explicit
> file rename information in the git commit, but in the very early days
> of the git design history, although BitKeeper did track file names,
> Linus consciously decided to go down a much simpler path.  So that's
> really more of a SMTP vs X.400 preference of simplicity versus
> complexity in the protocol versus implementation, which is something
> where people of good will might disagree --- and there Junio's
> opinions matter far more then mine.  :-)

I don't find file rename heuristics to be "simple", and they're often
wrong, though I've fully internalized that copies and renames have to be
done alone in separate commits with no contents changes so as to make
incorrect rename determinations much less likely.

Nico
--