[PATCH v2 12/13] bulk-checkin: don't fetch promised objects on write

Patrick Steinhardt <ps@xxxxxx> · Fri, 25 Apr 2025 09:09:03 +0200

When writing objects via the bulk-checkin subsystem we first try to
figure out whether an object already exists in the repository before we
append it to the packfile. This check uses `repo_has_object_file()`,
which knows to fetch promised objects by default. As such, if we were
about to write an object that is promised, we'd fetch the object via the
promisor and then skip writing it. This behaviour doesn't seem sensible:
it should be significantly faster to take the locally-written object
instead of faulting in objects from the promisor remote.

There is one counter-argument here: it could be that the bulk-checkin
mechanism will end up writing an object to disk whose content collides
with the object in the promisor remote. The local repository and its
promisor remote would now have two objects with different contents but
the same name. But the resulting behaviour would be wrong both when we
prefer the fetched object, and also when prefering the written object:

  - When we prefer the written object we will now see a different world
    compared to everyone else who has the promised object.

  - When we prefer the fetched object we will end up with an object that
    is different compared to what the user just asked us to write. This
    seems even worse compared to the first scenario.

In an ideal world, we would protect against this by fetching the
promised object and then performing a collision check. But this feels
exceedingly expensive and ultimately rather pointless, as more common
writing paths like `write_loose_object()` don't protect against this
scenario either. And in any case we're talking about a local user that
has write access to the repository anyway, so if they want to do any
kind of mischieve they already can.

Change the behaviour so that we don't fault in the object via the
promisor remote. We shouldn't have to worry about hash collisions too
much (yet) as the mechanism is only used during local writes anyway. And
even if there was a collision, prefering local data that we were just
asked to write over data controlled by a potentially untrusted remote
feels like the better failure mode.

Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
---
 bulk-checkin.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index c31c31b18d8..b182c456d69 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -130,7 +130,7 @@ static void flush_batch_fsync(void)
 static int already_written(struct bulk_checkin_packfile *state, struct object_id *oid)
 {
 	/* The object may already exist in the repository */
-	if (repo_has_object_file(the_repository, oid))
+	if (has_object(the_repository, oid, HAS_OBJECT_RECHECK_PACKED))
 		return 1;
 
 	/* Might want to keep the list sorted */

-- 
2.49.0.901.g37484f566f.dirty