[PATCH v2 0/8] repack: avoid MIDX'ing cruft pack(s) where possible

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Here is a non-RFC version of my series to explore creating MIDXs while
repacking that don't include the cruft pack.

The core idea behind this approach is to ensure that packs generated via
geometric repacking traverse through objects that appear in packs which
are neither included nor excluded. Then if some commit (for example) in
a pack reaches some once-unreachable object stored in a cruft pack, the
pack generated via geometric repacking will pick up and write a copy of
that object during its traversal.

If you repack consistently using this strategy, you can guarantee that
the union of geometrically-repacked packs are closed under reachability
without having to keep track of any cruft pack(s) in the MIDX.

This version has a couple of minor changes from the RFC:

  - Before using a designated initializer to setup a 'struct
    object_info', add a new preparatory commit to explain that such
    designated initializers rely on the default value for
    non-initialized fields to be zero'd.

  - Less cruft-pack specific reasoning for when repack can use this new
    mode (thanks to a helpful discussion with Peff while thinking
    through and talking about these changes).

  - A new 'repack.midxMustContainCruft' configuration knob to opt-in to
    this new behavior.

  - More readable (IMHO) test scripts.

I think this version is sufficiently ready for review. I'm going to
deploy a copy of this within GitHub's infrastructure and see how it
behaves on a single replica of an internal repository over a ~week and
report back.

Thanks in advance for any review in the meantime :-).

Taylor Blau (8):
  pack-objects: use standard option incompatibility functions
  object-store-ll.h: add note about designated initializers
  pack-objects: limit scope in 'add_object_entry_from_pack()'
  pack-objects: factor out handling '--stdin-packs'
  pack-objects: declare 'rev_info' for '--stdin-packs' earlier
  pack-objects: perform name-hash traversal for unpacked objects
  pack-objects: introduce '--stdin-packs=follow'
  repack: exclude cruft pack(s) from the MIDX where possible

 Documentation/config/repack.adoc    |   7 +
 Documentation/git-pack-objects.adoc |   8 +-
 builtin/pack-objects.c              | 193 +++++++++++++++++-----------
 builtin/repack.c                    | 162 ++++++++++++++++++++---
 object-store-ll.h                   |   8 ++
 t/t5331-pack-objects-stdin.sh       | 103 ++++++++++++++-
 t/t7704-repack-cruft.sh             |  90 +++++++++++++
 7 files changed, 478 insertions(+), 93 deletions(-)

Range-diff against v1:
1:  63fb4dab30 = 1:  65bc7e4630 pack-objects: use standard option incompatibility functions
-:  ---------- > 2:  920c91eb1e object-store-ll.h: add note about designated initializers
2:  6357633f6d = 3:  f8ac36b110 pack-objects: limit scope in 'add_object_entry_from_pack()'
3:  43e889b157 = 4:  5e03b482ba pack-objects: factor out handling '--stdin-packs'
4:  07a91be3ec = 5:  bccbac2ec5 pack-objects: declare 'rev_info' for '--stdin-packs' earlier
5:  241f7c87e5 = 6:  0bc2183dc3 pack-objects: perform name-hash traversal for unpacked objects
6:  a0318321ec = 7:  697a337cb1 pack-objects: introduce '--stdin-packs=follow'
7:  ef0bc38cf0 < -:  ---------- repack: keep track of existing MIDX'd packs
8:  19b69c1246 ! 8:  a2ec1b826c repack: exclude cruft pack(s) from the MIDX where possible
    @@ Commit message
         Note that you cannot guarantee that a collection of packs is closed
         under reachability if not all of them were generated with following as
         above. One tell-tale sign that not all geometrically-repacked packs in
    -    the MIDX were generated with following is to see if there is a cruft
    -    pack already in the MIDX.
    +    the MIDX were generated with following is to see if there is a pack in
    +    the existing MIDX that is not going to be somehow represented (either
    +    verbatim or as part of a geometric rollup) in the new MIDX.
     
         If there is, then starting to generate packs with following during
         geometric repacking won't work, since it's open to the same race as
    @@ Commit message
         under reachability.
     
         Detect when this is the case and avoid including cruft packs in the MIDX
    -    where possible.
    +    where possible. The existing behavior remains the default, and the new
    +    behavior is available with the config 'repack.midxMustIncludeCruft' set
    +    to 'false'.
     
         Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx>
     
    + ## Documentation/config/repack.adoc ##
    +@@ Documentation/config/repack.adoc: repack.cruftThreads::
    + 	a cruft pack and the respective parameters are not given over
    + 	the command line. See similarly named `pack.*` configuration
    + 	variables for defaults and meaning.
    ++
    ++repack.midxMustContainCruft::
    ++	When set to true, linkgit:git-repack[1] will unconditionally include
    ++	cruft pack(s), if any, in the multi-pack index when invoked with
    ++	`--write-midx`. When false, cruft packs are only included in the MIDX
    ++	when necessary (e.g., because they might be required to form a
    ++	reachability closure with MIDX bitmaps). Defaults to true.
    +
      ## builtin/repack.c ##
    -@@ builtin/repack.c: static void pack_mark_in_midx(struct string_list_item *item)
    - 	item->util = (void*)((uintptr_t)item->util | PACK_IN_MIDX);
    +@@ builtin/repack.c: static int write_bitmaps = -1;
    + static int use_delta_islands;
    + static int run_update_server_info = 1;
    + static char *packdir, *packtmp_name, *packtmp;
    ++static int midx_must_contain_cruft = 1;
    + 
    + static const char *const git_repack_usage[] = {
    + 	N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
    +@@ builtin/repack.c: static int repack_config(const char *var, const char *value,
    + 		free(cruft_po_args->threads);
    + 		return git_config_string(&cruft_po_args->threads, var, value);
    + 	}
    ++	if (!strcmp(var, "repack.midxmustcontaincruft")) {
    ++		midx_must_contain_cruft = git_config_bool(var, value);
    ++		return 0;
    ++	}
    + 	return git_default_config(var, value, ctx, cb);
    + }
    + 
    +@@ builtin/repack.c: static void free_pack_geometry(struct pack_geometry *geometry)
    + 	free(geometry->pack);
      }
      
    -+static int pack_is_in_midx(struct string_list_item *item)
    ++static int midx_has_unknown_packs(char **midx_pack_names,
    ++				  size_t midx_pack_names_nr,
    ++				  struct string_list *include,
    ++				  struct pack_geometry *geometry,
    ++				  struct existing_packs *existing)
     +{
    -+	return (uintptr_t)item->util & PACK_IN_MIDX;
    -+}
    ++	size_t i;
     +
    -+static int existing_has_cruft_in_midx(struct existing_packs *existing)
    -+{
    -+	struct string_list_item *item;
    -+	for_each_string_list_item(item, &existing->cruft_packs) {
    -+		if (pack_is_in_midx(item))
    -+			return 1;
    ++	string_list_sort(include);
    ++
    ++	for (i = 0; i < midx_pack_names_nr; i++) {
    ++		const char *pack_name = midx_pack_names[i];
    ++
    ++		/*
    ++		 * Determine whether or not each MIDX'd pack from the existing
    ++		 * MIDX (if any) is represented in the new MIDX. For each pack
    ++		 * in the MIDX, it must either be:
    ++		 *
    ++		 *  - In the "include" list of packs to be included in the new
    ++		 *    MIDX. Note this function is called before the include
    ++		 *    list is populated with any cruft pack(s).
    ++		 *
    ++		 *  - Below the geometric split line (if using pack geometry),
    ++		 *    indicating that the pack won't be included in the new
    ++		 *    MIDX, but its contents were rolled up as part of the
    ++		 *    geometric repack.
    ++		 *
    ++		 *  - In the existing non-kept packs list (if not using pack
    ++		 *    geometry), and marked as non-deleted.
    ++		 */
    ++		if (string_list_has_string(include, pack_name)) {
    ++			continue;
    ++		} else if (geometry) {
    ++			struct strbuf buf = STRBUF_INIT;
    ++			uint32_t j;
    ++
    ++			for (j = 0; j < geometry->split; j++) {
    ++				strbuf_reset(&buf);
    ++				strbuf_addstr(&buf, pack_basename(geometry->pack[j]));
    ++				strbuf_strip_suffix(&buf, ".pack");
    ++				strbuf_addstr(&buf, ".idx");
    ++
    ++				if (!strcmp(pack_name, buf.buf)) {
    ++					strbuf_release(&buf);
    ++					break;
    ++				}
    ++			}
    ++
    ++			strbuf_release(&buf);
    ++
    ++			if (j < geometry->split)
    ++				continue;
    ++		} else {
    ++			struct string_list_item *item;
    ++
    ++			item = string_list_lookup(&existing->non_kept_packs,
    ++						  pack_name);
    ++			if (item && !pack_is_marked_for_deletion(item))
    ++				continue;
    ++		}
    ++
    ++		/*
    ++		 * If we got to this point, the MIDX includes some pack that we
    ++		 * don't know about.
    ++		 */
    ++		return 1;
     +	}
    ++
     +	return 0;
     +}
     +
    - static void mark_packs_for_deletion_1(struct string_list *names,
    - 				      struct string_list *list)
    + struct midx_snapshot_ref_data {
    + 	struct tempfile *f;
    + 	struct oidset seen;
    +@@ builtin/repack.c: static void midx_snapshot_refs(struct tempfile *f)
    + 
    + static void midx_included_packs(struct string_list *include,
    + 				struct existing_packs *existing,
    ++				char **midx_pack_names,
    ++				size_t midx_pack_names_nr,
    + 				struct string_list *names,
    + 				struct pack_geometry *geometry)
      {
     @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      		}
      	}
      
     -	for_each_string_list_item(item, &existing->cruft_packs) {
    -+	if (existing_has_cruft_in_midx(existing)) {
    ++	if (midx_must_contain_cruft ||
    ++	    midx_has_unknown_packs(midx_pack_names, midx_pack_names_nr,
    ++				   include, geometry, existing)) {
      		/*
     -		 * When doing a --geometric repack, there is no need to check
     -		 * for deleted packs, since we're by definition not doing an
     -		 * ALL_INTO_ONE repack (hence no packs will be deleted).
     -		 * Otherwise we must check for and exclude any packs which are
     -		 * enqueued for deletion.
    -+		 * If we had one or more cruft pack(s) present in the
    -+		 * MIDX before the repack, keep them as they may be
    -+		 * required to form a reachability closure if the MIDX
    -+		 * is bitmapped.
    ++		 * If there are one or more unknown pack(s) present (see
    ++		 * midx_has_unknown_packs() for what makes a pack
    ++		 * "unknown") in the MIDX before the repack, keep them
    ++		 * as they may be required to form a reachability
    ++		 * closure if the MIDX is bitmapped.
      		 *
     -		 * So we could omit the conditional below in the --geometric
     -		 * case, but doing so is unnecessary since no packs are marked
     -		 * as pending deletion (since we only call
     -		 * `mark_packs_for_deletion()` when doing an all-into-one
     -		 * repack).
    -+		 * A cruft pack can be required to form a reachability
    -+		 * closure if the MIDX is bitmapped and one or more of
    -+		 * its selected commits reaches a once-cruft object that
    -+		 * was later made reachable.
    ++		 * For example, a cruft pack can be required to form a
    ++		 * reachability closure if the MIDX is bitmapped and one
    ++		 * or more of its selected commits reaches a once-cruft
    ++		 * object that was later made reachable.
      		 */
     -		if (pack_is_marked_for_deletion(item))
     -			continue;
    @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      	}
      
      	strbuf_release(&buf);
    +@@ builtin/repack.c: int cmd_repack(int argc,
    + 	struct tempfile *refs_snapshot = NULL;
    + 	int i, ext, ret;
    + 	int show_progress;
    ++	char **midx_pack_names = NULL;
    ++	size_t midx_pack_names_nr = 0;
    + 
    + 	/* variables to be filled by option parsing */
    + 	int delete_redundant = 0;
     @@ builtin/repack.c: int cmd_repack(int argc,
      		    !(pack_everything & PACK_CRUFT))
      			strvec_push(&cmd.args, "--pack-loose-unreachable");
      	} else if (geometry.split_factor) {
     -		strvec_push(&cmd.args, "--stdin-packs");
    -+		if (existing_has_cruft_in_midx(&existing))
    ++		if (midx_must_contain_cruft)
     +			strvec_push(&cmd.args, "--stdin-packs");
     +		else
     +			strvec_push(&cmd.args, "--stdin-packs=follow");
      		strvec_push(&cmd.args, "--unpacked");
      	} else {
      		strvec_push(&cmd.args, "--unpacked");
    +@@ builtin/repack.c: int cmd_repack(int argc,
    + 
    + 	string_list_sort(&names);
    + 
    ++	if (get_local_multi_pack_index(the_repository)) {
    ++		uint32_t i;
    ++		struct multi_pack_index *m =
    ++			get_local_multi_pack_index(the_repository);
    ++
    ++		ALLOC_ARRAY(midx_pack_names, m->num_packs);
    ++		for (i = 0; i < m->num_packs; i++)
    ++			midx_pack_names[midx_pack_names_nr++] = xstrdup(m->pack_names[i]);
    ++	}
    ++
    + 	close_object_store(the_repository->objects);
    + 
    + 	/*
    +@@ builtin/repack.c: int cmd_repack(int argc,
    + 
    + 	if (write_midx) {
    + 		struct string_list include = STRING_LIST_INIT_DUP;
    +-		midx_included_packs(&include, &existing, &names, &geometry);
    ++		midx_included_packs(&include, &existing, midx_pack_names,
    ++				    midx_pack_names_nr, &names, &geometry);
    + 
    + 		ret = write_midx_included_packs(&include, &geometry, &names,
    + 						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
    +@@ builtin/repack.c: int cmd_repack(int argc,
    + 	string_list_clear(&names, 1);
    + 	existing_packs_release(&existing);
    + 	free_pack_geometry(&geometry);
    ++	for (size_t i = 0; i < midx_pack_names_nr; i++)
    ++		free(midx_pack_names[i]);
    ++	free(midx_pack_names);
    + 	pack_objects_args_release(&po_args);
    + 	pack_objects_args_release(&cruft_po_args);
    + 
     
      ## t/t7704-repack-cruft.sh ##
     @@ t/t7704-repack-cruft.sh: test_expect_success 'cruft repack respects --quiet' '
      	)
      '
      
    -+test_expect_success 'repack --write-midx excludes cruft where possible' '
    -+	git init exclude-cruft-when-possible &&
    ++setup_cruft_exclude_tests() {
    ++	git init "$1" &&
     +	(
    -+		cd exclude-cruft-when-possible &&
    ++		cd "$1" &&
    ++
    ++		git config repack.midxMustContainCruft false &&
     +
     +		test_commit one &&
     +
    @@ t/t7704-repack-cruft.sh: test_expect_success 'cruft repack respects --quiet' '
     +		test_commit --no-tag three &&
     +		three="$(git rev-parse HEAD)" &&
     +		git reset --hard one &&
    -+
     +		git reflog expire --all --expire=all &&
     +
    -+		git repack --cruft -d &&
    -+		ls $packdir/pack-*.idx | sort >packs.before &&
    ++		GIT_TEST_MULTI_PACK_INDEX=0 git repack --cruft -d &&
     +
     +		git merge $two &&
    -+		test_commit four &&
    ++		test_commit four
    ++	)
    ++}
    ++
    ++test_expect_success 'repack --write-midx excludes cruft where possible' '
    ++	setup_cruft_exclude_tests exclude-cruft-when-possible &&
    ++	(
    ++		cd exclude-cruft-when-possible &&
    ++
    ++		GIT_TEST_MULTI_PACK_INDEX=0 \
     +		git repack -d --geometric=2 --write-midx --write-bitmap-index &&
    -+		ls $packdir/pack-*.idx | sort >packs.after &&
     +
    -+		comm -13 packs.before packs.after >packs.new &&
    -+		test_line_count = 1 packs.new &&
    ++		test-tool read-midx --show-objects $objdir >midx &&
    ++		cruft="$(ls $packdir/*.mtimes)" &&
    ++		test_grep ! "$(basename "$cruft" .mtimes).idx" midx &&
     +
    -+		git rev-list --objects --no-object-names one..four >expect.raw &&
    -+		sort expect.raw >expect &&
    ++		git rev-list --all --objects --no-object-names >reachable.raw &&
    ++		sort reachable.raw >reachable.objects &&
    ++		awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects &&
     +
    -+		git show-index <$(cat packs.new) >actual.raw &&
    -+		cut -d" " -f2 actual.raw | sort >actual &&
    ++		test_cmp reachable.objects midx.objects
    ++	)
    ++'
     +
    -+		test_cmp expect actual &&
    ++test_expect_success 'repack --write-midx includes cruft when instructed' '
    ++	setup_cruft_exclude_tests exclude-cruft-when-instructed &&
    ++	(
    ++		cd exclude-cruft-when-instructed &&
     +
    -+		test-tool read-midx --show-objects $objdir >actual.raw &&
    -+		grep "\.pack$" actual.raw | cut -d" " -f1 | sort >actual.objects &&
    -+		git rev-list --objects --no-object-names HEAD >expect.raw &&
    -+		sort expect.raw >expect.objects &&
    ++		GIT_TEST_MULTI_PACK_INDEX=0 \
    ++		git -c repack.midxMustContainCruft=true repack \
    ++			-d --geometric=2 --write-midx --write-bitmap-index &&
     +
    -+		test_cmp expect.objects actual.objects &&
    ++		test-tool read-midx --show-objects $objdir >midx &&
    ++		cruft="$(ls $packdir/*.mtimes)" &&
    ++		test_grep "$(basename "$cruft" .mtimes).idx" midx &&
     +
    -+		cruft="$(basename $(ls $packdir/*.mtimes))" &&
    -+		grep "^pack-" actual.raw >actual.packs &&
    -+		! test_grep "${cruft%.mtimes}.idx" actual.packs
    ++		git cat-file --batch-check="%(objectname)" --batch-all-objects \
    ++			>all.objects &&
    ++		awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects &&
    ++
    ++		test_cmp all.objects midx.objects
     +	)
     +'
     +
     +test_expect_success 'repack --write-midx includes cruft when necessary' '
    ++	setup_cruft_exclude_tests exclude-cruft-when-necessary &&
     +	(
    -+		cd exclude-cruft-when-possible &&
    ++		cd exclude-cruft-when-necessary &&
     +
    ++		test_path_is_file $(ls $packdir/pack-*.mtimes) &&
     +		ls $packdir/pack-*.idx | sort >packs.all &&
     +		grep -o "pack-.*\.idx$" packs.all >in &&
     +
     +		git multi-pack-index write --stdin-packs --bitmap <in &&
     +
     +		test_commit five &&
    ++		GIT_TEST_MULTI_PACK_INDEX=0 \
     +		git repack -d --geometric=2 --write-midx --write-bitmap-index &&
     +
    -+		test-tool read-midx --show-objects $objdir >actual.raw &&
    -+		grep "\.pack$" actual.raw | cut -d" " -f1 | sort >actual.objects &&
    ++		test-tool read-midx --show-objects $objdir >midx &&
    ++		awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects &&
     +		git cat-file --batch-all-objects --batch-check="%(objectname)" \
     +			>expect.objects &&
    -+		test_cmp expect.objects actual.objects &&
    ++		test_cmp expect.objects midx.objects &&
     +
    -+		grep "^pack-" actual.raw >actual.packs &&
    -+		test_line_count = "$(($(wc -l <packs.all) + 1))" actual.packs
    ++		grep "^pack-" midx >midx.packs &&
    ++		test_line_count = "$(($(wc -l <packs.all) + 1))" midx.packs
     +	)
     +'
     +

base-commit: 485f5f863615e670fd97ae40af744e14072cfe18
-- 
2.49.0.229.gc267761125.dirty




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux