Re: Collaborative community interview for Git's 20th anniversary

Elijah Newren <newren@xxxxxxxxx> · Wed, 23 Apr 2025 15:41:59 -0700

On Mon, Apr 14, 2025 at 5:31 AM Kaartic Sivaraam
<kaartic.sivaraam@xxxxxxxxx> wrote:
>
> Hello all,
>
> As part of the Git's 20th year anniversary, we from the Git Rev News
> team are thinking of doing a community interview where we would share a
> list of questions that we've prepared and we would like to welcome
> answers from anyone in the community for them. We could gather the
> answers for them upto a particular time (like 25/April or so) and begin
> curating the answers into a special interview for this month's edition.
> The questions are below. Feel free to respond with your answers to this
> mail thread. Let me know if I've missed to include any particularly
> compelling question.
>
>    - What's your favorite Git trick or workflow that you wish more people
>      knew about?

range-diff.  The ideas behind it ought to be the basis for code
review, IMO.  Commits should be the unit of review (including commit
messages as a fundamental and primary thing to be reviewed), and a
series of commits should be the unit of merging.  I dislike most code
review tools, because they get one or both of those things wrong.
Getting both of those things right naturally leads to range-diff or
something like it being a very important part of the workflow, at a
minimum for detecting which commits in a series are unmodified and
which have been updated and need to be further reviewed.

>    - What was your worst Git disaster, and how did you recover from it?

My worst Git-related disaster wasn't with Git directly but with our
Git hosting software we used at a prior job, Gerrit.  'twas a
"startup" that was still forming good practices.  We had both a
production and a staging instance.  The staging instance was seeded
with a copy of production data so we could do scale testing...but that
seeding process was a multi-step manual thing; it hadn't been
automated.  One step was, as best I recall, "drop database gerrit",
followed by loading the production copy of the mysql database (this
was long before NoteDB arrived).  And as many readers probably have
guessed by now, I was on the wrong host one day when I ran that
command.

The actual git repositories were still intact, but the review metadata
was toast.  Luckily, we had a backup from about 7 hours earlier, so we
could recover the older review metadata and with some hackery fix the
mysql metadata mismatch with the newer repository contents.  And since
Gerrit emailed folks comments from reviews as they were posted, we
could tell people to look at their emails for the pieces we couldn't
recover.

It was a really long night trying to fix things.  Some folks told me
they thought I was going to throw up just looking at me.  But I
learned how wonderful it was to be at a company with blameless
post-mortems, and I appreciated the many folks who reached out to tell
me stories of mistakes they had made.  They were more interested in
whether we learned our lesson and put processes into place to prevent
repeats, and I definitely did both.

I did, of course, also get some good-natured ribbing, such as people
saying I got to play the part of little Bobby Tables once (see
https://xkcd.com/327/ if you don't know that reference).  I kindly
reminded them that I didn't drop a table -- I dropped the whole
database (plus, it wasn't injection, it was just running a command in
the wrong location) .  Also, one of my colleagues helpfully modified
the prompt on production to be red and bold, "This is PROD Gerrit",
and the prompt on staging to be green, "This is staging Gerrit; it's
okay to drop database here!"  The prompts ended up not mattering since
I automated the process, and made sure the process just error'ed out
if run on prod instead of staging.  But the prompt persisted for many
years anyway, because I thought it was a hilarious way to poke fun at
my blunder.

>    - If you could go back in time and change one design decision in Git,
>      what would it be?

The index.  For a few reasons.

1) Performance.

1a) The index is pervasive throughout the codebase, and while it works
great for small repositories, it means that many operations are O(size
of repository) instead of O(size of changes).  sparse indices help,
but the code has to be carefully audited for sparse indices to work
with each codepath, and even then there tends to be a fallback of
just-load-everything-anyway because the data structure doesn't lend
nicely to just expanding a little more.

1b) An under-appreciated aspect of the performance improvements that
came from our new merge strategy, merge-ort, were due to dispensing
with the index as the primary data structure.  The index had two
problems:
1b-1) first of all it meant loading every path in the repository,
which would have prevented ort's optimization to avoid recursing into
subtrees when unnecessary (an optimization that often made merges e.g.
50x faster).  Sparse indices didn't exist back then, but even if they
had we would have had to complicate them significantly in order to
have their sparseness be determined by renames and the intersection of
modified paths on the two sides of history instead of having
sparseness determined by user-defined path rules; I think that'd have
been much more complicated than just dispensing with the index as the
data structure, but we didn't even have sparse indices back then
anyway.
1b-2) Second, the use of the index as done in the old merge strategy,
merge-recursive, resulted in O(N^2) behavior since entries (including
conflicted higher order stages) had to be inserted in sorted order.
Deleting entries didn't have the same O(N^2) problem due to some
tricks to queue the deletion for later, but attempting to do the same
for insertions was far from straightforward and I believe would have
required making some other data structure primary and then forming the
index at the end. (Note that the primary data structure used, whatever
it is, cannot just have a list of things to insert, it also needs to
be checked for various properties intermingled with insertions...and
those sometimes relied on the fact that the index was sorted for quick
lookups.)

(Note that a tree-structured index rather than a linear index would
resolve these problems.  But retrofitting the entire codebase is
probably never going to happen...)

2) Cognitive Complexity.

The funny thing is, although I say this, I use the index all the time.
I use `git add -p` a lot.  I very much need to slice and dice my
changes into different commits, and tend to have dirty changes that I
don't want pushed.

But slicing and dicing before things are committed, as opposed to
being able to slice and dice after, is a choice that adds a lot of
complexity to the user interface and does so even for users who aren't
interested in slicing and dicing commits.  We don't have a
sufficiently flexible set of tooling for slicing and dicing commits
after-the-fact within git to switch to a post-commit-slice-and-dice
workflow even today, but I suspect that some of the ideas from JJ
would or could be much better than the methods I use today in git to
slice and dice commits.

>    - Which Git feature or improvement over the past 20 years do you think
>      had the biggest impact on your workflow?

Speed.

Being able to instantly switch branches (in smaller repos, sure, but
CVS and SVN couldn't pull it off even in small repos) was a game
changer.

>    - What Git problem that existed 10 years ago has been most
>      successfully solved?

Merging and rebasing with lots of renames (and generally merging
without a worktree or index).  I'm obviously a bit biased on this
point, but that doesn't mean I'm wrong.  ;-)  It used to be awful and
works great now.

Relatedly, merging without a worktree or index was problematic; you
had to either use an alternative merge strategy with limited
capabilities, or use something other than git (e.g. libgit2).  But now
git handles it well with its default merge strategy.

>    - Which Git commands or workflows do you think are still misunderstood
>      or underutilized today?

range-diff is very under-utilized, but I already discussed that above.

>    - What's one Git based project, tool, or extension you think deserves
>      more recognition from the community?
>
>    - What Git feature or capability surprised you most when you first
>      discovered it?
>
>    - What's your boldest prediction about how version control might look
>      in another 20 years?

I'm more interested in what storms might be brewing along that path,
and what we might be able to do to avoid them.  In particular, some
questions and observations in that area:

  * With monorepos growing ever larger, do we have
hard-to-workaround-or-fix design decisions that pose scaling
challenges?  e.g.
    * the index data structure
    * per-directory .gitignore files, per-directory .gitattribute files, etc.
  * ...or do the prominent Git forges have hard-to-workaround-or-fix
design decisions that'll give Git a reputation for not scaling?  e.g.
    * making refs/pull/NNN/merge a public ref and excessively
implicitly updating it
  * Will we face a crisis of interest?  e.g.
    * git is currently written in C.  Even if that's not a liability
already, coupled with "decades" I think it is.  Young developers
probably don't want to learn C, and older ones who already know C may
worry about C becoming a Fortran or Cobol.
    * Companies employing git developers think "git already won" and
redeploy those engineers on other problems
  * Will the combination of issues above result in folks who want
improvements deciding their best bet is not improving git but in
creating/funding an alternative?  Will that snowball?

To me, the entry of new projects like jj and sapling suggest the above
are real concerns already rather than just theoretical.  Both projects
have compelling things that git lacks.  I like the friendly
competition, and the jj and sapling developers are awesome to talk to
at Git Merge conferences.  But there is a risk that this friendly
competition mirrors that of Git and Mercurial from years past, and
that Git at some future point down the road ends up on the other side
of that history and gets largely displaced by the alternatives.  I'd
rather not see that happen, but I sometimes wonder if we're taking
enough measures to avoid marching towards such an outcome.