Re: [PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Patrick Steinhardt <ps@xxxxxx> writes:

> Hi,
>
> at GitLab, we sometimes have the need to list all objects regardless of
> their reachability. We use git-cat-file(1) with `--batch-all-objects` to
> do this, and typically this is quite a good fit. In some cases though,
> we only want to list objects of a specific type, where we then basically
> have the following pipeline:
>
>     git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
>     grep '^commit ' |
>     cut -d' ' -f2 |
>     git cat-file --batch
>
> This works okayish in medium-sized repositories, but once you reach a
> certain size this isn't really an option anymore. In the Chromium
> repository for example [1] simply listing all objects in the first
> invocation of git-cat-file(1) takes around 80 to 100 seconds. The
> workload is completely I/O-bottlenecked: my machine reads at ~500MB/s,
> and the packfile is 50GB in size, which matches the 100 seconds that I
> observe.
>
> This series addresses the issue by introducing object filters into
> git-cat-file(1). These object filters use the exact same syntax as the
> filters we have in git-rev-list(1), but only a subset of them is
> supported because not all filters can be computed by git-cat-file(1).
> Supported are "blob:none", "blob:limit=" as well as "object:type=".
>
> The filters alone don't really help though: we still have to scan
> through the whole packfile in order to compute the packfiles. While we
> are able to shed a bit of CPU time because we can stop emitting some of
> the objects, we're still I/O-bottlenecked.
>
> The second part of the series thus expands the filters so that they can
> make use of bitmap indices for some of the filters, if available. This
> allows us to efficiently answer the question where to find all objects
> of a specific type, and thus we can avoid scanning through the packfile
> and instead directly look up relevant objects, leading to a significant
> speedup:
>
>     Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~)
>       Time (mean ± σ):     86.444 s ±  4.081 s    [User: 36.830 s, System: 11.312 s]
>       Range (min … max):   80.305 s … 93.104 s    10 runs
>
>     Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD)
>       Time (mean ± σ):      2.089 s ±  0.015 s    [User: 1.872 s, System: 0.207 s]
>       Range (min … max):    2.073 s …  2.119 s    10 runs
>
>     Summary
>       cat-file with filter=object:type=commit (revision = HEAD) ran
>        41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~)
>
> We now directly scale with the number of objects of a specific type
> contained in the packfile instead of scaling with the overall number of
> objects. It's quite fun to see how the math plays out: if you sum up the
> times for each of the types you arrive at the time for the unfiltered
> case.
>
> Changes in v2:
>   - The series is now built on top of "master" at 683c54c999c (Git 2.49,
>     2025-03-14) with "tb/incremental-midx-part-2" at 27afc272c49 (midx:
>     implement writing incremental MIDX bitmaps, 2025-03-20) merged into
>     it.
>   - Rename the filter options to "--filter=" to match
>     git-pack-objects(1).
>   - The bitmap-filtering is now reusing existing mechanisms that we
>     already have in "pack-bitmap.c", as proposed by Taylor.
>   - Link to v1: https://lore.kernel.org/r/20250221-pks-cat-file-object-type-filter-v1-0-0852530888e2@xxxxxx
>
> Changes in v3:
>   - Wrap some overly long lines.
>   - Better describe how filters interact with the different batch modes.
>   - Adapt the format with `--batch` and `--batch-check` so that we tell
>     the user that the object has been excluded.
>   - Add a test for "--no-filter".
>   - Use `OPT_PARSE_LIST_OBJECTS_FILTER()`.
>   - Link to v2: https://lore.kernel.org/r/20250327-pks-cat-file-object-type-filter-v2-0-4bbc7085d7c5@xxxxxx
>
> Thanks!
>
> Patrick
>
> [1]: https://github.com/chromium/chromium.git
>
> ---
> Patrick Steinhardt (11):
>       builtin/cat-file: rename variable that tracks usage
>       builtin/cat-file: introduce function to report object status
>       builtin/cat-file: wire up an option to filter objects
>       builtin/cat-file: support "blob:none" objects filter
>       builtin/cat-file: support "blob:limit=" objects filter
>       builtin/cat-file: support "object:type=" objects filter
>       pack-bitmap: allow passing payloads to `show_reachable_fn()`
>       pack-bitmap: add function to iterate over filtered bitmapped objects
>       pack-bitmap: introduce function to check whether a pack is bitmapped
>       builtin/cat-file: deduplicate logic to iterate over all objects
>       builtin/cat-file: use bitmaps to efficiently filter by object type
>
>  Documentation/git-cat-file.adoc |  26 ++++
>  builtin/cat-file.c              | 256 +++++++++++++++++++++++++++++-----------
>  builtin/pack-objects.c          |   3 +-
>  builtin/rev-list.c              |   3 +-
>  pack-bitmap.c                   |  81 +++++++++++--
>  pack-bitmap.h                   |  22 +++-
>  reachable.c                     |   3 +-
>  t/t1006-cat-file.sh             |  99 ++++++++++++++++
>  8 files changed, 411 insertions(+), 82 deletions(-)
>
> Range-diff versus v2:
>
>  1:  a75888e0bf4 !  1:  b0642b6c495 builtin/cat-file: rename variable that tracks usage
>     @@ builtin/cat-file.c: int cmd_cat_file(int argc,
>       		;
>       	else if (batch.follow_symlinks)
>      -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
>     -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
>     - 			       "--follow-symlinks");
>     +-			       "--follow-symlinks");
>     ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
>     ++			       options, "--follow-symlinks");
>       	else if (batch.buffer_output >= 0)
>      -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
>     -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
>     - 			       "--buffer");
>     +-			       "--buffer");
>     ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
>     ++			       options, "--buffer");
>       	else if (batch.all_objects)
>      -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
>     -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
>     - 			       "--batch-all-objects");
>     +-			       "--batch-all-objects");
>     ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
>     ++			       options, "--batch-all-objects");
>       	else if (input_nul_terminated)
>      -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
>     -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
>     - 			       "-z");
>     +-			       "-z");
>     ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
>     ++			       options, "-z");
>       	else if (nul_terminated)
>      -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
>     -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
>     - 			       "-Z");
>     +-			       "-Z");
>     ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
>     ++			       options, "-Z");
>
>       	batch.input_delim = batch.output_delim = '\n';
>     + 	if (input_nul_terminated)
>      @@ builtin/cat-file.c: int cmd_cat_file(int argc,
>       			batch.transform_mode = opt;
>       		else if (opt && opt != 'b')
>     @@ builtin/cat-file.c: int cmd_cat_file(int argc,
>      +				       builtin_catfile_usage, options, opt);
>       		else if (argc)
>      -			usage_msg_opt(_("batch modes take no arguments"), usage,
>     -+			usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
>     - 				      options);
>     +-				      options);
>     ++			usage_msg_opt(_("batch modes take no arguments"),
>     ++				      builtin_catfile_usage, options);
>
>       		return batch_objects(&batch);
>     + 	}
>      @@ builtin/cat-file.c: int cmd_cat_file(int argc,
>       	if (opt) {
>       		if (!argc && opt == 'c')
>       			usage_msg_optf(_("<rev> required with '%s'"),
>      -				       usage, options, "--textconv");
>     -+				       builtin_catfile_usage, options, "--textconv");
>     ++				       builtin_catfile_usage, options,
>     ++				       "--textconv");
>       		else if (!argc && opt == 'w')
>       			usage_msg_optf(_("<rev> required with '%s'"),
>      -				       usage, options, "--filters");
>     -+				       builtin_catfile_usage, options, "--filters");
>     ++				       builtin_catfile_usage, options,
>     ++				       "--filters");
>       		else if (!argc && opt_epts)
>       			usage_msg_optf(_("<object> required with '-%c'"),
>      -				       usage, options, opt);
>     @@ builtin/cat-file.c: int cmd_cat_file(int argc,
>       			obj_name = argv[0];
>       		else
>      -			usage_msg_opt(_("too many arguments"), usage, options);
>     -+			usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options);
>     ++			usage_msg_opt(_("too many arguments"), builtin_catfile_usage,
>     ++				      options);
>       	} else if (!argc) {
>      -		usage_with_options(usage, options);
>      +		usage_with_options(builtin_catfile_usage, options);
>  -:  ----------- >  2:  18353ba706d builtin/cat-file: introduce function to report object status
>  2:  bee9407c1a9 !  3:  1e46af5d07b builtin/cat-file: wire up an option to filter objects
>     @@ Documentation/git-cat-file.adoc: OPTIONS
>      +--filter=<filter-spec>::
>      +--no-filter::
>      +	Omit objects from the list of printed objects. This can only be used in
>     -+	combination with one of the batched modes. The '<filter-spec>' may be
>     -+	one of the following:
>     ++	combination with one of the batched modes. Excluded objects that have
>     ++	been explicitly requested via any of the batch modes that read objects
>     ++	via standard input (`--batch`, `--batch-check`) will be reported as
>     ++	"filtered". Excluded objects in `--batch-all-objects` mode will not be
>     ++	printed at all. No filters are supported yet.
>      +
>       --path=<path>::
>       	For use with `--textconv` or `--filters`, to allow specifying an object
>       	name and a path separately, e.g. when it is difficult to figure out
>     +@@ Documentation/git-cat-file.adoc: the repository, then `cat-file` will ignore any custom format and print:
>     + <object> SP missing LF
>     + ------------
>     +
>     ++If a name is specified on stdin that is filtered out via `--filter=`,
>     ++then `cat-file` will ignore any custom format and print:
>     ++
>     ++------------
>     ++<object> SP excluded LF
>     ++------------
>     ++
>     + If a name is specified that might refer to more than one object (an ambiguous short sha), then `cat-file` will ignore any custom format and print:
>     +
>     + ------------
>
>       ## builtin/cat-file.c ##
>      @@
>     @@ builtin/cat-file.c: int cmd_cat_file(int argc,
>       			    N_("run filters on object's content"), 'w'),
>       		OPT_STRING(0, "path", &force_path, N_("blob|tree"),
>       			   N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
>     -+		OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
>     -+			     N_("object filtering"), opt_parse_list_objects_filter),
>     ++		OPT_PARSE_LIST_OBJECTS_FILTER(&batch.objects_filter),
>       		OPT_END()
>       	};
>
>     @@ builtin/cat-file.c: int cmd_cat_file(int argc,
>       	if (opt == 'b')
>       		batch.all_objects = 1;
>      @@ builtin/cat-file.c: int cmd_cat_file(int argc,
>     - 			usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
>     - 				      options);
>     + 			usage_msg_opt(_("batch modes take no arguments"),
>     + 				      builtin_catfile_usage, options);
>
>      -		return batch_objects(&batch);
>      +		ret = batch_objects(&batch);
>     @@ t/t1006-cat-file.sh: test_expect_success PERL '--batch-command info is unbuffere
>      +		test_cmp expect err
>      +	'
>      +done
>     ++
>     ++test_expect_success 'objects filter: disabled' '
>     ++	git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --no-filter >actual &&
>     ++	sort actual >actual.sorted &&
>     ++	git -C repo rev-list --objects --no-object-names --all >expect &&
>     ++	sort expect >expect.sorted &&
>     ++	test_cmp expect.sorted actual.sorted
>     ++'
>      +
>       test_done
>  3:  ec1d0c63de6 !  4:  878ae8e2a76 builtin/cat-file: support "blob:none" objects filter
>     @@ Commit message
>          Implement support for the "blob:none" filter in git-cat-file(1), which
>          causes us to omit all blobs.
>
>     +    Note that this new filter requires us to read the object type via
>     +    `oid_object_info_extended()` in `batch_object_write()`. But as we try to
>     +    optimize away reading objects from the database the `data->info.typep`
>     +    pointer may not be set. We thus have to adapt the logic to conditionally
>     +    set the pointer in cases where the filter is given.
>     +
>          Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
>
>       ## Documentation/git-cat-file.adoc ##
>      @@ Documentation/git-cat-file.adoc: OPTIONS
>     - 	Omit objects from the list of printed objects. This can only be used in
>     - 	combination with one of the batched modes. The '<filter-spec>' may be
>     - 	one of the following:
>     + 	been explicitly requested via any of the batch modes that read objects
>     + 	via standard input (`--batch`, `--batch-check`) will be reported as
>     + 	"filtered". Excluded objects in `--batch-all-objects` mode will not be
>     +-	printed at all. No filters are supported yet.
>     ++	printed at all. The '<filter-spec>' may be one of the following:
>      ++
>      +The form '--filter=blob:none' omits all blobs.
>
>     @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
>       		case LOFC_DISABLED:
>       			break;
>      +		case LOFC_BLOB_NONE:
>     -+			if (data->type == OBJ_BLOB)
>     ++			if (data->type == OBJ_BLOB) {
>     ++				if (!opt->all_objects)
>     ++					report_object_status(opt, obj_name,
>     ++							     &data->oid, "excluded");
>      +				return;
>     ++			}
>      +			break;
>       		default:
>       			BUG("unsupported objects filter");
>     @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' '
>       do
>       	test_expect_success "objects filter with unsupported option $option" '
>       		case "$option" in
>     -@@ t/t1006-cat-file.sh: do
>     - 	'
>     - done
>     +@@ t/t1006-cat-file.sh: test_expect_success 'objects filter: disabled' '
>     + 	test_cmp expect.sorted actual.sorted
>     + '
>
>      +test_objects_filter () {
>      +	filter="$1"
>     @@ t/t1006-cat-file.sh: do
>      +		sort expect >expect.sorted &&
>      +		test_cmp expect.sorted actual.sorted
>      +	'
>     ++
>     ++	test_expect_success "objects filter prints excluded objects: $filter" '
>     ++		# Find all objects that would be excluded by the current filter.
>     ++		git -C repo rev-list --objects --no-object-names --all >all &&
>     ++		git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >filtered &&
>     ++		sort all >all.sorted &&
>     ++		sort filtered >filtered.sorted &&
>     ++		comm -23 all.sorted filtered.sorted >expected.excluded &&
>     ++		test_line_count -gt 0 expected.excluded &&
>     ++
>     ++		git -C repo cat-file --batch-check="%(objectname)" --filter="$filter" <expected.excluded >actual &&
>     ++		awk "/excluded/{ print \$1 }" actual | sort >actual.excluded &&
>     ++		test_cmp expected.excluded actual.excluded
>     ++	'
>      +}
>      +
>      +test_objects_filter "blob:none"
>  4:  a3ed054994d !  5:  a88d5d4b60a builtin/cat-file: support "blob:limit=" objects filter
>     @@ Commit message
>
>       ## Documentation/git-cat-file.adoc ##
>      @@ Documentation/git-cat-file.adoc: OPTIONS
>     - 	one of the following:
>     + 	printed at all. The '<filter-spec>' may be one of the following:
>       +
>       The form '--filter=blob:none' omits all blobs.
>      ++
>     @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
>       		if (pack)
>       			ret = packed_object_info(the_repository, pack, offset,
>      @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
>     - 			if (data->type == OBJ_BLOB)
>       				return;
>     + 			}
>       			break;
>      +		case LOFC_BLOB_LIMIT:
>      +			if (data->type == OBJ_BLOB &&
>     -+			    data->size >= opt->objects_filter.blob_limit_value)
>     ++			    data->size >= opt->objects_filter.blob_limit_value) {
>     ++				if (!opt->all_objects)
>     ++					report_object_status(opt, obj_name,
>     ++							     &data->oid, "excluded");
>      +				return;
>     ++			}
>      +			break;
>       		default:
>       			BUG("unsupported objects filter");
>     @@ t/t1006-cat-file.sh: test_objects_filter () {
>      +test_objects_filter "blob:limit=1"
>      +test_objects_filter "blob:limit=500"
>      +test_objects_filter "blob:limit=1000"
>     -+test_objects_filter "blob:limit=1g"
>     ++test_objects_filter "blob:limit=1k"
>
>       test_done
>  5:  8e39cd218c2 !  6:  13be54300c9 builtin/cat-file: support "object:type=" objects filter
>     @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
>       		if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
>       			data->info.sizep = &data->size;
>      @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
>     - 			    data->size >= opt->objects_filter.blob_limit_value)
>       				return;
>     + 			}
>       			break;
>      +		case LOFC_OBJECT_TYPE:
>     -+			if (data->type != opt->objects_filter.object_type)
>     ++			if (data->type != opt->objects_filter.object_type) {
>     ++				if (!opt->all_objects)
>     ++					report_object_status(opt, obj_name,
>     ++							     &data->oid, "excluded");
>      +				return;
>     ++			}
>      +			break;
>       		default:
>       			BUG("unsupported objects filter");
>     @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' '
>      @@ t/t1006-cat-file.sh: test_objects_filter "blob:limit=1"
>       test_objects_filter "blob:limit=500"
>       test_objects_filter "blob:limit=1000"
>     - test_objects_filter "blob:limit=1g"
>     + test_objects_filter "blob:limit=1k"
>      +test_objects_filter "object:type=blob"
>      +test_objects_filter "object:type=commit"
>      +test_objects_filter "object:type=tag"
>  6:  a0655de3ace =  7:  d525a5bc2ef pack-bitmap: allow passing payloads to `show_reachable_fn()`
>  7:  e1e44303dac =  8:  e3cc1ae3a87 pack-bitmap: add function to iterate over filtered bitmapped objects
>  8:  23bc040bb15 =  9:  c0fc0e4ce0c pack-bitmap: introduce function to check whether a pack is bitmapped
>  9:  4eba2a70619 = 10:  28ef93dceec builtin/cat-file: deduplicate logic to iterate over all objects
> 10:  d40f1924ef5 = 11:  842a6002c50 builtin/cat-file: use bitmaps to efficiently filter by object type
>

Thanks for the new version, the range-diff looks good. Good that you
also added a test for "excluded" message too.

> ---
> base-commit: 003c5f45b8447877015b2a23ceab2297638fe1f1
> change-id: 20250220-pks-cat-file-object-type-filter-9140c0ed5ee1

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux