[PATCH] fetch-pack: re-scan when double-checking graph objects

Jeff King <peff@xxxxxxxx> · Sun, 24 Aug 2025 01:00:40 -0400

The fetch code tries to avoid asking the remote side for an object we
already have. It does this by traversing recent commits reachable from
our refs looking for matches. Commit 5d4cc78f72 (fetch-pack: die if in
commit graph but not obj db, 2024-11-05) introduced an extra check
there: if we think we have an object because it's in the commit graph,
we double-check that we actually have it in our object database with a
call to odb_has_object().

But that call does not pass any flags, and so the function won't call
reprepared_packed_git() if it does not find the object. That opens us up
to the usual race against some other process repacking the odb:

  1. We scan the list of packs in objects/pack but haven't yet opened them.

  2. Somebody else packs the object into a new pack (which we don't know
     about), and deletes the old pack it was in.

  3. Our odb_has_object() calls tries to open that old pack, but finds it
     is gone. We declare that we don't have the object.

And this causes us to erroneously complain and abort the fetch, thinking
our commit-graph and object database are out of sync. Instead, we should
pass HAS_OBJECT_RECHECK_PACKED, which will add a new step:

  4. We re-scan the pack directory again, find the new pack, and locate
     the object.

Often the fetch code tries to avoid these kinds of re-scans if it's
likely that we won't have the object. If the other side has told us
about object X and we want to know if we have it, we'll skip the re-scan
(to avoid spending a lot of effort when there are many such objects). We
can accept the racy false negative in that case because the worst case
is that we ask the other side to send us the object.

But this is not one of those cases. These are objects which are
accessible from _our_ refs, and which we already found in the commit
graph file. We should have them, and if we don't, we'll die()
immediately. So the performance impact is negligible, and getting the
right answer is important.

There's no test here because it's inherently racy. In fact, I had
trouble even developing a minimal test. The problem seen in the wild can
be produced like this:

  # Any git.git mirror which supports partial clones; I think this
  # should work with any repo that contains submodules, but note that
  # $obj below is specific to this repo
  url=https://github.com/git/git.git

  # This is a commit that is not at the tip of any branches (so after
  # we have it, we'll still have some commits to fetch).
  obj=cf6f63ea6bf35173e02e18bdc6a4ba41288acff9

  git init
  git fetch --filter=tree:0 $url $obj:refs/heads/foo
  git checkout foo
  git commit-graph write --reachable
  git fetch $url

What happens here is that the initial fetch grabs that older commit (and
its ancestors) but no trees or blobs, and the subsequent checkout grabs
the necessary trees and blobs just for that commit. The final fetch
spawns a long sequence of child fetches due to fetch_submodules(), which
wants to check whether there have been any gitlink modifications which
should trigger a fetch of the related submodule (we'll leave aside the
irony that we did not even check out any submodules yet).

That series of fetches causes us to accumulate packs, which eventually
triggers background maintenance to run. That repacks all-into-one, and
the pack containing $obj goes away in favor of a new pack. And then the
fetch eventually fails with:

  fatal: You are attempting to fetch cf6f63ea6bf35173e02e18bdc6a4ba41288acff9, which is in the commit graph file but
not in the object database.

In the scenario above, the race becomes likely because of the long
series of quick fetches. But I _think_ the bug is independent of partial
clones entirely, and you could run into the same thing with a single
fetch, some other process running "git repack" simultaneously, and a bit
of bad luck. I haven't been able to reproduce, though. I'm not sure if
that's because there's some mis-analysis above, or if the race window is
just small enough that it's hard to trigger.

At any rate, re-scanning here seems like an obviously correct thing to
do with no downside, and it does fix the partial-clone case shown above.

Reported-by: Дилян Палаузов <dilyan.palauzov@xxxxxxxxx>
Signed-off-by: Jeff King <peff@xxxxxxxx>
---
Originally discussed in:

  https://lore.kernel.org/git/e7a2fdff63d9a90ef4dc1341fa642fff5197b64a.camel@xxxxxxxxx/

 fetch-pack.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 20e5533b21..6ed5662951 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -143,7 +143,8 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 	commit = lookup_commit_in_graph(the_repository, oid);
 	if (commit) {
 		if (mark_tags_complete_and_check_obj_db) {
-			if (!odb_has_object(the_repository->objects, oid, 0))
+			if (!odb_has_object(the_repository->objects, oid,
+					    HAS_OBJECT_RECHECK_PACKED))
 				die_in_commit_graph_only(oid);
 		}
 		return commit;
-- 
2.51.0.364.gbc6ae1dd20