On Fri, May 16, 2025 at 11:34:10AM -0700, Junio C Hamano wrote: > > The reason why we do this is likely because we reuse `p->pack_name` to > > derive the other files' names, as well, so that we only have to allocate > > this buffer once. As such, we have to compute the packfile name last so > > that it doesn't get overwritten by the other names. > > I vaguely recall that the order of checks between .idx and .pack > were deliberately chosen, as we do not want to add and attempt to > use .pack file before its associated .idx file is ready since > without the latter the former is unusable at runtime. It is definitely intentional and important that we discover packs by their .idx file first. The writing side uses the creation of the .idx file (via atomic rename) as the final step in making a pack accessible. Before that, a reader could see a .pack without its various accessory files (and so not realize it has a .keep file, for example). > Side note: It may have been more important back when the name of > the packfile was a hash of names of the objects in the pack, > which meant that when you repack a quiescent and fully repacked > repository twice with different parameters, you could have ended > up with a packfile whose name is the same but the contents as a > bytestream (hence the object offset) are different, making a > stale .idx not pointing into the correct position in the new > .pack file. These days the name of the packfile is based on the > contents of the pack as a bytestream, so we no longer suffer > in such a scenario. I don't think reading the .idx file first was sufficient to protect us there, since the .pack could be replaced racily after the reader opened the .idx. I don't recall if we were subject to that race back then (and it just didn't come up enough to worry about), or if repack would throw away an identically-named file. Anyway, as you note, it's no longer an issue. But I think that is mostly orthogonal to how the auxiliary files are handled. I think Patrick's guess is correct that the order we have is simply because it was most convenient to end up with ".pack" in the name. > > All of this likely doesn't matter in the general case: Git shouldn't end > > up looking too many nonexistent packfiles anyway. > > In any case, we shouldn't attempt to use .pack when its associated > files are missing. It is not about nonexistent but about incomplete > (including "in the process of being written"). This shouldn't be a race for files being written. The .idx is written last, so the .pack should always be there. And the midx is written after that, so again, the pack should always be there. At least for newly written packs. We might see the opposite: packs going away because they were repacked (and perhaps even a new midx generated). In that case you might see a race like: - process R opens the existing midx, which mentions pack P (but does not yet open P) - process W repacks, deleting P and generating a new midx with P' - process R wants object X; the midx claims it is in pack P, so it tries to open that but fails What should happen here is that R will fail to find the object at all, should hit reprepare_packed_git(), and will then pick up the new pack. I don't recall offhand whether reprepare_packed_git() will open the new midx. If it doesn't, we'd still find the object via the actual pack .idx, but we may end up consulting the now-stale midx over and over (so we'll get the right answer, but there may be some room for optimization). > > while the issue was entirely self-made because the multi-pack index > > should have been regenerated, we can still reduce the number of syscalls > > by 75% in the case of nonexistent packfiles by reordering these calls. > > That sounds more like a band-aid, if we still do the remaining 25% > that we somehow know would be unnecessary. Yeah, that was my gut reaction, too. add_packed_git() is not optimized[1] at all. But it shouldn't have to be, because it's meant to be called once per process. Even with the re-ordering, we'd still make a bunch of pointless stat() calls for the .pack file we know is not there. The code in prepare_midx_pack() converts a numeric pack id (referenced inside the midx) into a "struct packed_git" pointer, caching the results in multi_pack_index->packs. That field holds NULL for "we have not looked it up yet" or a valid pointer to a packed_git. It probably needs to hold a third state: "we tried and failed". Something like this (large untested) patch: diff --git a/midx.c b/midx.c index 3d0015f782..354b1f886c 100644 --- a/midx.c +++ b/midx.c @@ -405,7 +405,7 @@ void close_midx(struct multi_pack_index *m) munmap((unsigned char *)m->data, m->data_len); for (i = 0; i < m->num_packs; i++) { - if (m->packs[i]) + if (m->packs[i] && m->packs[i] != (void *)(intptr_t)-1) m->packs[i]->multi_pack_index = 0; } FREE_AND_NULL(m->packs); @@ -458,6 +458,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, pack_int_id = midx_for_pack(&m, pack_int_id); + if (m->packs[pack_int_id] == (void *)(intptr_t)-1) + return 1; if (m->packs[pack_int_id]) return 0; @@ -482,8 +484,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, strbuf_release(&pack_name); strbuf_release(&key); - if (!p) + if (!p) { + m->packs[pack_int_id] = (void *)(intptr_t)-1; return 1; + } p->multi_pack_index = 1; m->packs[pack_int_id] = p; -Peff [1] There probably are optimization opportunities in add_packed_git(). I don't think re-ordering will help much in the common case that we actually do have the pack. But really, most callers do not care about these auxiliary files at all! We could simply skip them during the initial setup and lazy-load them via accessor functions. I _think_ that should be OK with respect to races. For a newly added pack, we know they will always be in place before the matching .idx file (per the logic I outlined above). For a pack that goes away, we might racily fail to see its auxiliary file. But that is mostly true now (we might see its .idx, and then the .promisor file is deleted before we call access()). It does increase the size of that window, though (and in particular lets it happen even if the pack has actually been opened). I'm not sure how much that would matter in practice. OTOH, I'm not sure that saving a few access() calls would be all that meaningful, unless you have a zillion packs. And if you have a zillion packs, you're likely to get much bigger speedups in other areas by repacking them anyway. So it hasn't really been an area I've pursued before.