Re: [PATCH] packfile: avoid access(3p) calls for missing packs

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 16 May 2025 11:34:10 -0700

Patrick Steinhardt <ps@xxxxxx> writes:

> The function `add_packed_git()` adds a packfile to the known list of
> packs in case it exists. In case the packfile doesn't look like a pack
> though, or in case it doesn't exist, the function will simply return a
> `NULL` pointer.
>
> The ordering in which those checks are done is somewhat curious though:
> we first perform a couple of access(3p) syscalls for auxiliary files
> like ".keep" before we determine whether the ".pack" file itself exists.
> And if we see that the ".pack" file does not exist, we bail out and thus
> effectively discard all the information. This means that the access(3p)
> calls were a complete waste of time.
>
> The reason why we do this is likely because we reuse `p->pack_name` to
> derive the other files' names, as well, so that we only have to allocate
> this buffer once. As such, we have to compute the packfile name last so
> that it doesn't get overwritten by the other names.

I vaguely recall that the order of checks between .idx and .pack
were deliberately chosen, as we do not want to add and attempt to
use .pack file before its associated .idx file is ready since
without the latter the former is unusable at runtime.  

    Side note: It may have been more important back when the name of
    the packfile was a hash of names of the objects in the pack,
    which meant that when you repack a quiescent and fully repacked
    repository twice with different parameters, you could have ended
    up with a packfile whose name is the same but the contents as a
    bytestream (hence the object offset) are different, making a
    stale .idx not pointing into the correct position in the new
    .pack file.  These days the name of the packfile is based on the
    contents of the pack as a bytestream, so we no longer suffer
    in such a scenario.

But I wouldn't be surprised if checks for other auxiliary files were
added later to come next to .idx not .pack, exactly because .pack is
the primary thing.

> All of this likely doesn't matter in the general case: Git shouldn't end
> up looking too many nonexistent packfiles anyway.

In any case, we shouldn't attempt to use .pack when its associated
files are missing.  It is not about nonexistent but about incomplete
(including "in the process of being written").

> But there are edge
> cases where it _does_ matter. One such edge case that we have hit in our
> production systems was a stale "multi-pack-index" file that contained
> references to nonexistent packfiles. As there is no negative lookup
> cache this causes us to repeatedly look up the same packfiles via the
> multi-pack index, only to realize that they don't exist. This translated
> into hundreds of thousands of syscalls to access(3p) and stat(3p).

Ouch.  Multi-pack index is a more recent invention, and I am not
surprised if such bugs still lurk.  And as you hinted, a solution
may be a negative "known-not-to-exist" list, but given one such
stale midx, would we try to open the same missing packfiles over and
over within the same process?

> while the issue was entirely self-made because the multi-pack index
> should have been regenerated, we can still reduce the number of syscalls
> by 75% in the case of nonexistent packfiles by reordering these calls.

That sounds more like a band-aid, if we still do the remaining 25%
that we somehow know would be unnecessary.

Note that I do not know if the ordering that I think was originally
introduced for correctness still matters for correctness.  The first
paragraph of this message was written without consulting "git log",
lore archive, or even the current state of the source, but from
vague memory.