Re: Poor performance using reftable with many refs

Jeff King <peff@xxxxxxxx> · Thu, 13 Feb 2025 03:22:21 -0500

On Thu, Feb 13, 2025 at 08:13:38AM +0100, Patrick Steinhardt wrote:

> Turns out that you're hitting quite a funny edge case: the issue comes
> from you first deleting all preexisting refs in the target repository
> before recreating them. With "packed-refs", this leads to a repository
> that has neither a "packed-refs" file nor any loose ref, except for HEAD
> of course. But with "reftables" it doesn't:
> 
>     total 368
>     -rw-r--r-- 1 pks users 332102 Feb 13 08:00 0x000000000001-0x000000000001-d8285c7c.ref
>     -rw-r--r-- 1 pks users  32941 Feb 13 08:00 0x000000000002-0x000000000003-f1a8ebf9.ref
>     -rw-r--r-- 1 pks users     86 Feb 13 08:00 tables.list
> 
> We end up with two tables: the first one has been created when cloning
> the repository and contains all references. The second one has been
> created when deleting all references, so it only contains ref deletions.
> Because deletions don't have to carry an object ID, the resulting table
> is also much smaller. This has the effect that auto-compaction does not
> kick in, because we see that the geometric sequence is still intact. And
> consequently, all the checks that we perform when recreating the refs
> are way more expensive now because we have to search for conflicts.

That makes sense. But that's only 360k of reftables. Why does it take so
long to process?

It's been a while since I looked at reftables, but I'd think for a
normal lookup we should be able to binary-search or similar in each
table, find the relevant entries, and be done.

But I guess we can't easily do that for finding write conflicts, because
writing "foo/bar" means we need to care about "foo" and "foo/bar/baz" as
well. Finding "foo" is easy; we just break apart the proposed refname
and look for each leading path. But "foo/bar/baz" is harder; we have to
merge the tables to get an authoritative sorted list of the current
state, and then look for the entries adjacent to where our proposed ref
goes. Looking at a profiling output, we're spending a lot of time in
merged_iter_next_void() and its children, which supports that.

But the run-time scales linearly with the number of refs we're adding,
which implies that we're doing this iteration independently once per ref
we're adding. Instead, if we're given a list of N refs to write, we
should be able to sort that list and walk it in parallel with the
merged_iter result, making a single pass over the lists.

So I guess we'd need a version of refs_verify_refname_available() that
takes a list of refs rather than a single name. And then you could do
that single-pass walk. And as a bonus, you'd be able to de-dup the
leading prefixes you're looking for (e.g., most of your refs will start
with "refs/heads/", so we only need to check it once).

> That being said, I found an optimization in how we parse ref updates in
> git-update-ref(1): when we see an exact object ID, we can skip the call
> to `repo_get_oid()`. This function is quite expensive because it doesn't
> only parse object IDs, but revisions in general. This didn't have much
> of an impact on "packed-refs", because there are no references in the
> first place. But it did have a significant impact on the "reftable"
> backend, where we do have deleted references.

Yes, we do similarly spend a lot of time there. But the problem isn't
quite that repo_get_oid() also parses revisions. When we see a full
object id we return it quickly. But you can fall afoul of 798c35fcd8
(get_sha1: warn about full or short object names that look like refs,
2013-05-29), which does a full dwim_ref() lookup for each one!

Try:

  git -c core.warnAmbiguousRefs=false update-ref --stdin

to disable that. Internally there's a warn_on_object_refname_ambiguity
flag that some code (like cat-file) sets when it knows it may be asked
to do a resolve a lot of entries that are likely to be oids.

I kind of wonder if we should ditch that warning. But if we wanted to
keep it, maybe adding a flag to get_oid_with_context() would be a less
hacky way of disabling that warning on a per-call basis.

-Peff