Re: [PATCH v2 8/8] repack: exclude cruft pack(s) from the MIDX where possible

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 14, 2025 at 08:11:22PM -0700, Elijah Newren wrote:
> On Mon, Apr 14, 2025 at 1:06 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
> >
> > In ddee3703b3 (builtin/repack.c: add cruft packs to MIDX during
> > geometric repack, 2022-05-20), repack began adding cruft pack(s) to the
> > MIDX with '--write-midx' to ensure that the resulting MIDX was always
> > closed under reachability in order to generate reachability bitmaps.
> >
> > Suppose you have a once-unreachable object packed in a cruft pack, which
> > later on becomes reachable from one or more objects in a geometrically
> > repacked pack. That once-unreachable object *won't* appear in the new
> > pack, since the cruft pack was specified as neither included nor
> > excluded to 'pack-objects --stdin-packs'.
>
> I believe you are talking about the state before your series (i.e.,
> this is carrying on from the previous paragraph), but it reads as
> though you are talking about the state after the first seven patches
> of this series.  Some kind of connection wording to clarify would
> really help here.

Sure.

> > If the bitmap selection
> > process picks one or more commits which reach the once-unreachable
> > objects, commit ddee3703b3 ensures that the MIDX will be closed under
> > reachability. Without it, we would fail to generate a MIDX bitmap.
>
> After reading this part, I had to go back and re-read and figure out
> what point in time everything was referring to.

Yeah, this is confusing to me too after reading it back. I made some
tweaks that I think clarify things.

> > ddee3703b3 alludes to the fact that this is sub-optimal by saying
> >
> >     [...] it's desirable to avoid including cruft packs in the MIDX
> >     because it causes the MIDX to store a bunch of objects which are
> >     likely to get thrown away.
> >
> > , which is true, but hides an even larger problem. If repositories
> > rarely prune their unreachable objects and/or have many of them, the
> > MIDX must keep track of a large number of objects which bloats the MIDX
> > and slows down object lookup.
> >
> > This is doubly unfortunate because the vast majority of objects in cruft
> > pack(s) are unlikely to be read, but object reads that go through the
> > MIDX have to search through them anyway.
>
> "have to search through them"?  That could be read to suggest those
> individual objects are read, rather than just traversed over.  Maybe
> "...unlikely to be read, so the enlarged MIDX is for mostly tracking
> known-to-likely-be-irrelevant objects", or something like that?

Thanks for pointing out... I clarified this one as well.

> > This patch causes geometrically-repacked packs to contain a copy of any
> > once-unreachable object(s) with 'git pack-objects --stdin-packs=follow',
> > allowing us to avoid including any cruft packs in the MIDX. This is
> > because a sequence of geometrically-repacked packs that were all
> > generated with '--stdin-packs=follow' are guaranteed to have their union
> > be closed under reachability.
> >
> > Note that you cannot guarantee that a collection of packs is closed
> > under reachability if not all of them were generated with following as
>
> maybe: ...with "follow" as above.  "follow" or "following" feels like
> it needs quotes so the reader understands its meant as the name of a
> mode, rather than a verb in the sentence.
>
> > above. One tell-tale sign that not all geometrically-repacked packs in
> > the MIDX were generated with following is to see if there is a pack in
>
> same here with "following"...and below.

Great calls on both, thanks.

> > @@ -808,26 +886,55 @@ static void midx_included_packs(struct string_list *include,
> >                 }
> >         }
> >
> > -       for_each_string_list_item(item, &existing->cruft_packs) {
> > +       if (midx_must_contain_cruft ||
> > +           midx_has_unknown_packs(midx_pack_names, midx_pack_names_nr,
> > +                                  include, geometry, existing)) {
> >                 /*
> > -                * When doing a --geometric repack, there is no need to check
> > -                * for deleted packs, since we're by definition not doing an
> > -                * ALL_INTO_ONE repack (hence no packs will be deleted).
> > -                * Otherwise we must check for and exclude any packs which are
> > -                * enqueued for deletion.
> > +                * If there are one or more unknown pack(s) present (see
> > +                * midx_has_unknown_packs() for what makes a pack
> > +                * "unknown") in the MIDX before the repack, keep them
> > +                * as they may be required to form a reachability
> > +                * closure if the MIDX is bitmapped.
> >                  *
> > -                * So we could omit the conditional below in the --geometric
> > -                * case, but doing so is unnecessary since no packs are marked
> > -                * as pending deletion (since we only call
> > -                * `mark_packs_for_deletion()` when doing an all-into-one
> > -                * repack).
> > +                * For example, a cruft pack can be required to form a
> > +                * reachability closure if the MIDX is bitmapped and one
> > +                * or more of its selected commits reaches a once-cruft
> > +                * object that was later made reachable.
>
> The antecedent of "its" is unclear here; just spell it out to reduce
> how much thinking the reader needs to do?

Eek, good suggestion again. Thanks, I fixed it up and made the
antecedent explicit.

> >                  */
> > -               if (pack_is_marked_for_deletion(item))
> > -                       continue;
> > +               for_each_string_list_item(item, &existing->cruft_packs) {
> > +                       /*
> > +                        * When doing a --geometric repack, there is no
> > +                        * need to check for deleted packs, since we're
> > +                        * by definition not doing an ALL_INTO_ONE
> > +                        * repack (hence no packs will be deleted).
> > +                        * Otherwise we must check for and exclude any
> > +                        * packs which are enqueued for deletion.
> > +                        *
> > +                        * So we could omit the conditional below in the
> > +                        * --geometric case, but doing so is unnecessary
> > +                        *  since no packs are marked as pending
> > +                        *  deletion (since we only call
> > +                        *  `mark_packs_for_deletion()` when doing an
> > +                        *  all-into-one repack).
> > +                        */
> > +                       if (pack_is_marked_for_deletion(item))
> > +                               continue;
> >
> > -               strbuf_reset(&buf);
> > -               strbuf_addf(&buf, "%s.idx", item->string);
> > -               string_list_insert(include, buf.buf);
> > +                       strbuf_reset(&buf);
> > +                       strbuf_addf(&buf, "%s.idx", item->string);
> > +                       string_list_insert(include, buf.buf);
> > +               }
> > +       } else {
> > +               /*
> > +                * Modern versions of Git will write new copies of
> > +                * once-cruft objects when doing a --geometric repack.
>
> "Modern versions of Git" -> "Modern versions of Git with the
> appropriate config setting" ?

Heh. Great catch. Can you tell this part was written before I added the
configuration option? ;-)

Thanks,
Taylor




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux