Re: Re Question About Sorting the Index

Jon Forrest <nobozo@xxxxxxxxx> · Sat, 17 May 2025 10:20:29 -0700

First of all, thanks for the very thoughtful response.

On 5/16/25 8:46 PM, K Jayatheerth wrote:

Correct the index maintains `struct cache_entry` entries in sorted
order by the file name. This is essential for fast lookup, diffing, and
pathspec-based operations.

Somehow I hadn't realized that the index was also stored in
memory. However, my concerns about index lookups where each
entry in the index, in memory or on disk, isn't the same size
still apply. How can you do a binary search if you don't know
where the middle of the index is? Maybe I'd understand if I
had studied the source code in more detail.

Not exactly. Git keeps the index in memory as a sorted array, so adding
a new entry doesn't require a full resort just a binary search to find
the right insertion point. Only when the index is written to disk does
it serialize the in-memory array, which is already sorted.

I see. However, my comment above still applies.

If so, this seems like a lot of
work, especially since not all the entries
are the same size.

That's true in principle, but in practice, the memory layout of
`cache_entry` objects and memory mapping makes it quite efficient,
especially since typical index sizes are modest.

I guess you're right since the Git universe isn't clamoring
about poor performance when accessing the index.

you're describing a secondary structure
like a sparse index or log-structured merge pattern. It has its appeal,
particularly for large repositories with high-churn working trees.

Right. I would think it would also add some degree of safety since
the index file would only be rewritten when garbage collection is
done. The "index index" would be much more volatile but since it
could be easily and quickly recreated any time, there would be no added
danger.

many Git operations rely on the index being sorted. Reads would
have to scan or use your "index index", which introduces more I/O and
complexity.

I'm not convinced it would result in more I/O. The "index index" file
would be written more often, true, but each entry in it is much smaller
than the in the regular index.

Maybe a better approach would only keep the "index index" in memory.
This would result in less memory being used, with access to entries
in the regular index being fast since the "index index" entries
contain enough data to do random access to the desired entry
in the on-disk index.

On the third hand, all this might be unnecessary since the way
Git currently works is good enough for most (maybe all) people
most (maybe all) of the time.

This assumes laziness in cleanup, which is reasonable for append-only
systems. But Git today avoids keeping dead entries around for clarity
and correctness (especially under concurrent access).

If Postgres can do it, so could Git.

Agreed but it's worth noting that sorting a relatively small
in-memory structure (as Git does now) is often cheaper than
maintaining two files in sync (your full index and index index).

With my in memory "index index", this would become easier.

You’re not missing much in fact, Git has features like the "split index",
"untracked cache", and "index v4" that address similar performance issues
through other means. Your idea would likely help in edge cases (very large
repos, massive parallelism), but the added complexity and I/O overhead of
maintaining multiple files likely outweighs the benefits for the common case.

That is indeed the question. I admittedly have no data either about how
large a repo would have to be for a change like this to have any effect,
or how many such repos exist. I notice that the currrent Git repo has
4638 entries, which I suppose really isn't all that large.

Thanks again,
Jon