[PATCH v5 0/9] repack: avoid MIDX'ing cruft pack(s) where possible

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(Junio: I rebased this onto the tip of current 'master', which is
f9aa0eedb3 (Start 2.51 cycle, the first batch, 2025-06-17) at the time
of writing.)

Here is another small reroll of my series to create MIDXs that do not
include a repository's cruft pack(s).

The bulk of the series is unchanged, save for a couple of points I'll
discuss below. As usual, a complete-range diff is included below for
convenience. There are two tiny but important changes that I included
here as a result of rolling this out to GitHub's infrastructure. They
are:

 - Tolerating missing objects in the follow-on traversal. When missing
   an object in a non "follow" `--stdin-packs` traversal, we didn't
   notice it because that traversal does not add discovered objects to
   the packing list. That is not the case for `--stdin-packs=follow`
   traversals which do. This has been corrected, and is explained in
   detail in the penultimate commit.

 - The behavior when writing a MIDX after a noop repack. There are
   some cases here where we may need to include the cruft pack in the
   MIDX depending on how the existing packs are structured. This is
   corrected and explained in more detail in the final commit.

There is an additional change that Carlo Marcelo Arenas Belón pointed
out[1] that drops the non-portable "grep -o" invocations added by the
previous round.

I promised in the last round that I'd report any interesting findings
after deploying this to GitHub's infrastructure. Here they are:

 - On unreachable object-heavy repositories, I was able to measure
   about a ~5% performance improvement across many operations.

 - GitHub's average time to repack a repository went down by about
   ~20% as a result of rolling this change out everywhere.

Thanks in advance for any review :-).

[1]: <lfrgmt2ukanevmcctzsnc422iv2l2nb3qmiddpsj6jnyvz4m4s@5eohhsm6knw3>

Taylor Blau (9):
  pack-objects: use standard option incompatibility functions
  pack-objects: limit scope in 'add_object_entry_from_pack()'
  pack-objects: factor out handling '--stdin-packs'
  pack-objects: declare 'rev_info' for '--stdin-packs' earlier
  pack-objects: perform name-hash traversal for unpacked objects
  pack-objects: fix typo in 'show_object_pack_hint()'
  pack-objects: swap 'show_{object,commit}_pack_hint'
  pack-objects: introduce '--stdin-packs=follow'
  repack: exclude cruft pack(s) from the MIDX where possible

 Documentation/config/repack.adoc    |   7 +
 Documentation/git-pack-objects.adoc |  10 +-
 builtin/pack-objects.c              | 193 ++++++++++++++++++----------
 builtin/repack.c                    | 184 +++++++++++++++++++++++---
 t/t5331-pack-objects-stdin.sh       | 122 +++++++++++++++++-
 t/t7704-repack-cruft.sh             | 145 +++++++++++++++++++++
 6 files changed, 570 insertions(+), 91 deletions(-)

Range-diff against v4:
 1:  f8b31c6a8d =  1:  ebaf47262a pack-objects: use standard option incompatibility functions
 2:  2753e29648 =  2:  eaa1f41b25 pack-objects: limit scope in 'add_object_entry_from_pack()'
 3:  32b49d9073 =  3:  8d0492a80d pack-objects: factor out handling '--stdin-packs'
 4:  a797ff3a83 =  4:  3d5d3b78b2 pack-objects: declare 'rev_info' for '--stdin-packs' earlier
 5:  29bf05633a =  5:  2a3676cb86 pack-objects: perform name-hash traversal for unpacked objects
 6:  0696fa1736 =  6:  bcbce75695 pack-objects: fix typo in 'show_object_pack_hint()'
 7:  1cc45b4472 =  7:  c8cf316c50 pack-objects: swap 'show_{object,commit}_pack_hint'
 8:  3e3d929bd0 !  8:  b81b6213e8 pack-objects: introduce '--stdin-packs=follow'
    @@ Commit message
         copy in the cruft pack, otherwise we cannot generate reachability
         bitmaps for any commits which reach that object.
     
    +    Note that the traversal here is best-effort, similar to the existing
    +    traversal which provides name-hash hints. This means that the object
    +    traversal may hand us back a blob that does not actually exist. We
    +    *won't* see missing trees/commits with 'ignore_missing_links' because:
    +
    +     - missing commit parents are discarded at the commit traversal stage by
    +       revision.c::process_parents()
    +
    +     - missing tag objects are discarded by revision.c::handle_commit()
    +
    +     - missing tree objects are discarded by the list-objects code in
    +       list-objects.c::process_tree()
    +
    +    But we have to handle potentially-missing blobs specially by making a
    +    separate check to ensure they exist in the repository. Failing to do so
    +    would mean that we'd add an object to the packing list which doesn't
    +    actually exist, rendering us unable to write out the pack.
    +
         This prepares us for new repacking behavior which will "resurrect"
         objects found in cruft or otherwise unspecified packs when generating
         new packs. In the context of geometric repacking, this may be used to
    @@ builtin/pack-objects.c: static int add_object_entry_from_pack(const struct objec
     -		return;
     +	enum stdin_packs_mode mode = *(enum stdin_packs_mode *)data;
     +	if (mode == STDIN_PACKS_MODE_FOLLOW) {
    ++		if (object->type == OBJ_BLOB && !has_object(the_repository,
    ++							    &object->oid, 0))
    ++			return;
     +		add_object_entry(&object->oid, object->type, name, 0);
     +	} else {
     +		struct object_entry *oe = packlist_find(&to_pack, &object->oid);
    @@ t/t5331-pack-objects-stdin.sh: test_expect_success 'pack-objects --stdin with pa
      	test_cmp expected-objects actual-objects
      '
      
    -+packdir=.git/objects/pack
    ++objdir=.git/objects
    ++packdir=$objdir/pack
     +
     +objects_in_packs () {
     +	for p in "$@"
    @@ t/t5331-pack-objects-stdin.sh: test_expect_success 'pack-objects --stdin with pa
     +		test_cmp expect actual
     +	)
     +'
    ++
    ++stdin_packs__follow_with_only () {
    ++	rm -fr stdin_packs__follow_with_only &&
    ++	git init stdin_packs__follow_with_only &&
    ++	(
    ++		cd stdin_packs__follow_with_only &&
    ++
    ++		test_commit A &&
    ++		test_commit B &&
    ++
    ++		git rev-parse "$@" >B.objects &&
    ++
    ++		echo A | git pack-objects --revs $packdir/pack &&
    ++		B="$(git pack-objects $packdir/pack <B.objects)" &&
    ++
    ++		git cat-file --batch-check="%(objectname)" --batch-all-objects >objs &&
    ++		for obj in $(cat objs)
    ++		do
    ++			rm -f $objdir/$(test_oid_to_path $obj) || return 1
    ++		done &&
    ++
    ++		( cd $packdir && ls pack-*.pack ) >in &&
    ++		git pack-objects --stdin-packs=follow --stdout >/dev/null <in
    ++	)
    ++}
    ++
    ++test_expect_success '--stdin-packs=follow tolerates missing blobs' '
    ++	stdin_packs__follow_with_only HEAD HEAD^{tree}
    ++'
    ++
    ++test_expect_success '--stdin-packs=follow tolerates missing trees' '
    ++	stdin_packs__follow_with_only HEAD HEAD:B.t
    ++'
    ++
    ++test_expect_success '--stdin-packs=follow tolerates missing commits' '
    ++	stdin_packs__follow_with_only HEAD HEAD^{tree}
    ++'
     +
      test_done
 9:  52a069ef48 !  9:  6487001f64 repack: exclude cruft pack(s) from the MIDX where possible
    @@ Commit message
         of subsequently generated packs from geometric repacking *is* closed
         under reachability.
     
    +    (One exception here is when "starting from scratch" results in a noop
    +    repack, e.g., because the non-cruft pack(s) in a repository already form
    +    a geometric progression. Since we can't tell whether or not those were
    +    generated with '--stdin-packs=follow', they may depend on
    +    once-unreachable objects, so we have to include the cruft pack in the
    +    MIDX in this case.)
    +
         Detect when this is the case and avoid including cruft packs in the MIDX
         where possible. The existing behavior remains the default, and the new
         behavior is available with the config 'repack.midxMustIncludeCruft' set
    @@ builtin/repack.c: int cmd_repack(int argc,
      	} else {
      		strvec_push(&cmd.args, "--unpacked");
     @@ builtin/repack.c: int cmd_repack(int argc,
    + 	if (ret)
    + 		goto cleanup;
    + 
    +-	if (!names.nr && !po_args.quiet)
    +-		printf_ln(_("Nothing new to pack."));
    ++	if (!names.nr) {
    ++		if (!po_args.quiet)
    ++			printf_ln(_("Nothing new to pack."));
    ++		/*
    ++		 * If we didn't write any new packs, the non-cruft packs
    ++		 * may refer to once-unreachable objects in the cruft
    ++		 * pack(s).
    ++		 *
    ++		 * If there isn't already a MIDX, the one we write
    ++		 * must include the cruft pack(s), in case the
    ++		 * non-cruft pack(s) refer to once-cruft objects.
    ++		 *
    ++		 * If there is already a MIDX, we can punt here, since
    ++		 * midx_has_unknown_packs() will make the decision for
    ++		 * us.
    ++		 */
    ++		if (!get_local_multi_pack_index(the_repository))
    ++			midx_must_contain_cruft = 1;
    ++	}
    + 
    + 	if (pack_everything & PACK_CRUFT) {
    + 		const char *pack_prefix = find_pack_prefix(packdir, packtmp);
    +@@ builtin/repack.c: int cmd_repack(int argc,
      
      	string_list_sort(&names);
      
    @@ t/t7704-repack-cruft.sh: test_expect_success 'cruft repack respects --quiet' '
     +		cd exclude-cruft-when-necessary &&
     +
     +		test_path_is_file $(ls $packdir/pack-*.mtimes) &&
    -+		ls $packdir/pack-*.idx | sort >packs.all &&
    -+		grep -o "pack-.*\.idx$" packs.all >in &&
    -+
    -+		git multi-pack-index write --stdin-packs --bitmap <in &&
    ++		( cd $packdir && ls pack-*.idx ) | sort >packs.all &&
    ++		git multi-pack-index write --stdin-packs --bitmap <packs.all &&
     +
     +		test_commit five &&
     +		GIT_TEST_MULTI_PACK_INDEX=0 \
    @@ t/t7704-repack-cruft.sh: test_expect_success 'cruft repack respects --quiet' '
     +		test_line_count = "$(($(wc -l <packs.all) + 1))" midx.packs
     +	)
     +'
    ++
    ++test_expect_success 'repack --write-midx includes cruft when already geometric' '
    ++	git init repack--write-midx-geometric-noop &&
    ++	(
    ++		cd repack--write-midx-geometric-noop &&
    ++
    ++		git branch -M main &&
    ++		test_commit A &&
    ++		test_commit B &&
    ++
    ++		git checkout -B side &&
    ++		test_commit --no-tag C &&
    ++		C="$(git rev-parse HEAD)" &&
    ++
    ++		git checkout main &&
    ++		git branch -D side &&
    ++		git reflog expire --all --expire=all &&
    ++
    ++		# At this point we have two packs: one containing the
    ++		# objects belonging to commits A and B, and another
    ++		# (cruft) pack containing the objects belonging to
    ++		# commit C.
    ++		git repack --cruft -d &&
    ++
    ++		# Create a third pack which contains a merge commit
    ++		# making commit C reachable again.
    ++		#
    ++		# --no-ff is important here, as it ensures that we
    ++		# actually write a new object and subsequently a new
    ++		# pack to contain it.
    ++		git merge --no-ff $C &&
    ++		git repack -d &&
    ++
    ++		ls $packdir/pack-*.idx | sort >packs.all &&
    ++		cruft="$(ls $packdir/pack-*.mtimes)" &&
    ++		cruft="${cruft%.mtimes}.idx" &&
    ++
    ++		for idx in $(grep -v $cruft <packs.all)
    ++		do
    ++			git show-index <$idx >out &&
    ++			wc -l <out || return 1
    ++		done >sizes.raw &&
    ++
    ++		# Make sure that there are two non-cruft packs, and
    ++		# that one of them contains at least twice as many
    ++		# objects as the other, ensuring that they are already
    ++		# in a geometric progression.
    ++		sort -n sizes.raw >sizes &&
    ++		test_line_count = 2 sizes &&
    ++		s1=$(head -n 1 sizes) &&
    ++		s2=$(tail -n 1 sizes) &&
    ++		test "$s2" -gt "$((2 * $s1))" &&
    ++
    ++		git -c repack.midxMustContainCruft=false repack --geometric=2 \
    ++			--write-midx --write-bitmap-index
    ++	)
    ++'
     +
      test_done

base-commit: f9aa0eedb37eb94d9d3711ef0d565fd7cb3b6148
-- 
2.50.0.61.gf819b10624.dirty




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux