[PATCH v3 00/11] builtin/cat-file: allow filtering objects in batch mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

at GitLab, we sometimes have the need to list all objects regardless of
their reachability. We use git-cat-file(1) with `--batch-all-objects` to
do this, and typically this is quite a good fit. In some cases though,
we only want to list objects of a specific type, where we then basically
have the following pipeline:

    git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
    grep '^commit ' |
    cut -d' ' -f2 |
    git cat-file --batch

This works okayish in medium-sized repositories, but once you reach a
certain size this isn't really an option anymore. In the Chromium
repository for example [1] simply listing all objects in the first
invocation of git-cat-file(1) takes around 80 to 100 seconds. The
workload is completely I/O-bottlenecked: my machine reads at ~500MB/s,
and the packfile is 50GB in size, which matches the 100 seconds that I
observe.

This series addresses the issue by introducing object filters into
git-cat-file(1). These object filters use the exact same syntax as the
filters we have in git-rev-list(1), but only a subset of them is
supported because not all filters can be computed by git-cat-file(1).
Supported are "blob:none", "blob:limit=" as well as "object:type=".

The filters alone don't really help though: we still have to scan
through the whole packfile in order to compute the packfiles. While we
are able to shed a bit of CPU time because we can stop emitting some of
the objects, we're still I/O-bottlenecked.

The second part of the series thus expands the filters so that they can
make use of bitmap indices for some of the filters, if available. This
allows us to efficiently answer the question where to find all objects
of a specific type, and thus we can avoid scanning through the packfile
and instead directly look up relevant objects, leading to a significant
speedup:

    Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~)
      Time (mean ± σ):     86.444 s ±  4.081 s    [User: 36.830 s, System: 11.312 s]
      Range (min … max):   80.305 s … 93.104 s    10 runs

    Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD)
      Time (mean ± σ):      2.089 s ±  0.015 s    [User: 1.872 s, System: 0.207 s]
      Range (min … max):    2.073 s …  2.119 s    10 runs

    Summary
      cat-file with filter=object:type=commit (revision = HEAD) ran
       41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~)

We now directly scale with the number of objects of a specific type
contained in the packfile instead of scaling with the overall number of
objects. It's quite fun to see how the math plays out: if you sum up the
times for each of the types you arrive at the time for the unfiltered
case.

Changes in v2:
  - The series is now built on top of "master" at 683c54c999c (Git 2.49,
    2025-03-14) with "tb/incremental-midx-part-2" at 27afc272c49 (midx:
    implement writing incremental MIDX bitmaps, 2025-03-20) merged into
    it.
  - Rename the filter options to "--filter=" to match
    git-pack-objects(1).
  - The bitmap-filtering is now reusing existing mechanisms that we
    already have in "pack-bitmap.c", as proposed by Taylor.
  - Link to v1: https://lore.kernel.org/r/20250221-pks-cat-file-object-type-filter-v1-0-0852530888e2@xxxxxx

Changes in v3:
  - Wrap some overly long lines.
  - Better describe how filters interact with the different batch modes.
  - Adapt the format with `--batch` and `--batch-check` so that we tell
    the user that the object has been excluded.
  - Add a test for "--no-filter".
  - Use `OPT_PARSE_LIST_OBJECTS_FILTER()`.
  - Link to v2: https://lore.kernel.org/r/20250327-pks-cat-file-object-type-filter-v2-0-4bbc7085d7c5@xxxxxx

Thanks!

Patrick

[1]: https://github.com/chromium/chromium.git

---
Patrick Steinhardt (11):
      builtin/cat-file: rename variable that tracks usage
      builtin/cat-file: introduce function to report object status
      builtin/cat-file: wire up an option to filter objects
      builtin/cat-file: support "blob:none" objects filter
      builtin/cat-file: support "blob:limit=" objects filter
      builtin/cat-file: support "object:type=" objects filter
      pack-bitmap: allow passing payloads to `show_reachable_fn()`
      pack-bitmap: add function to iterate over filtered bitmapped objects
      pack-bitmap: introduce function to check whether a pack is bitmapped
      builtin/cat-file: deduplicate logic to iterate over all objects
      builtin/cat-file: use bitmaps to efficiently filter by object type

 Documentation/git-cat-file.adoc |  26 ++++
 builtin/cat-file.c              | 256 +++++++++++++++++++++++++++++-----------
 builtin/pack-objects.c          |   3 +-
 builtin/rev-list.c              |   3 +-
 pack-bitmap.c                   |  81 +++++++++++--
 pack-bitmap.h                   |  22 +++-
 reachable.c                     |   3 +-
 t/t1006-cat-file.sh             |  99 ++++++++++++++++
 8 files changed, 411 insertions(+), 82 deletions(-)

Range-diff versus v2:

 1:  a75888e0bf4 !  1:  b0642b6c495 builtin/cat-file: rename variable that tracks usage
    @@ builtin/cat-file.c: int cmd_cat_file(int argc,
      		;
      	else if (batch.follow_symlinks)
     -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
    -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
    - 			       "--follow-symlinks");
    +-			       "--follow-symlinks");
    ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
    ++			       options, "--follow-symlinks");
      	else if (batch.buffer_output >= 0)
     -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
    -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
    - 			       "--buffer");
    +-			       "--buffer");
    ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
    ++			       options, "--buffer");
      	else if (batch.all_objects)
     -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
    -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
    - 			       "--batch-all-objects");
    +-			       "--batch-all-objects");
    ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
    ++			       options, "--batch-all-objects");
      	else if (input_nul_terminated)
     -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
    -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
    - 			       "-z");
    +-			       "-z");
    ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
    ++			       options, "-z");
      	else if (nul_terminated)
     -		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
    -+		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options,
    - 			       "-Z");
    +-			       "-Z");
    ++		usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage,
    ++			       options, "-Z");
      
      	batch.input_delim = batch.output_delim = '\n';
    + 	if (input_nul_terminated)
     @@ builtin/cat-file.c: int cmd_cat_file(int argc,
      			batch.transform_mode = opt;
      		else if (opt && opt != 'b')
    @@ builtin/cat-file.c: int cmd_cat_file(int argc,
     +				       builtin_catfile_usage, options, opt);
      		else if (argc)
     -			usage_msg_opt(_("batch modes take no arguments"), usage,
    -+			usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
    - 				      options);
    +-				      options);
    ++			usage_msg_opt(_("batch modes take no arguments"),
    ++				      builtin_catfile_usage, options);
      
      		return batch_objects(&batch);
    + 	}
     @@ builtin/cat-file.c: int cmd_cat_file(int argc,
      	if (opt) {
      		if (!argc && opt == 'c')
      			usage_msg_optf(_("<rev> required with '%s'"),
     -				       usage, options, "--textconv");
    -+				       builtin_catfile_usage, options, "--textconv");
    ++				       builtin_catfile_usage, options,
    ++				       "--textconv");
      		else if (!argc && opt == 'w')
      			usage_msg_optf(_("<rev> required with '%s'"),
     -				       usage, options, "--filters");
    -+				       builtin_catfile_usage, options, "--filters");
    ++				       builtin_catfile_usage, options,
    ++				       "--filters");
      		else if (!argc && opt_epts)
      			usage_msg_optf(_("<object> required with '-%c'"),
     -				       usage, options, opt);
    @@ builtin/cat-file.c: int cmd_cat_file(int argc,
      			obj_name = argv[0];
      		else
     -			usage_msg_opt(_("too many arguments"), usage, options);
    -+			usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options);
    ++			usage_msg_opt(_("too many arguments"), builtin_catfile_usage,
    ++				      options);
      	} else if (!argc) {
     -		usage_with_options(usage, options);
     +		usage_with_options(builtin_catfile_usage, options);
 -:  ----------- >  2:  18353ba706d builtin/cat-file: introduce function to report object status
 2:  bee9407c1a9 !  3:  1e46af5d07b builtin/cat-file: wire up an option to filter objects
    @@ Documentation/git-cat-file.adoc: OPTIONS
     +--filter=<filter-spec>::
     +--no-filter::
     +	Omit objects from the list of printed objects. This can only be used in
    -+	combination with one of the batched modes. The '<filter-spec>' may be
    -+	one of the following:
    ++	combination with one of the batched modes. Excluded objects that have
    ++	been explicitly requested via any of the batch modes that read objects
    ++	via standard input (`--batch`, `--batch-check`) will be reported as
    ++	"filtered". Excluded objects in `--batch-all-objects` mode will not be
    ++	printed at all. No filters are supported yet.
     +
      --path=<path>::
      	For use with `--textconv` or `--filters`, to allow specifying an object
      	name and a path separately, e.g. when it is difficult to figure out
    +@@ Documentation/git-cat-file.adoc: the repository, then `cat-file` will ignore any custom format and print:
    + <object> SP missing LF
    + ------------
    + 
    ++If a name is specified on stdin that is filtered out via `--filter=`,
    ++then `cat-file` will ignore any custom format and print:
    ++
    ++------------
    ++<object> SP excluded LF
    ++------------
    ++
    + If a name is specified that might refer to more than one object (an ambiguous short sha), then `cat-file` will ignore any custom format and print:
    + 
    + ------------
     
      ## builtin/cat-file.c ##
     @@
    @@ builtin/cat-file.c: int cmd_cat_file(int argc,
      			    N_("run filters on object's content"), 'w'),
      		OPT_STRING(0, "path", &force_path, N_("blob|tree"),
      			   N_("use a <path> for (--textconv | --filters); Not with 'batch'")),
    -+		OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"),
    -+			     N_("object filtering"), opt_parse_list_objects_filter),
    ++		OPT_PARSE_LIST_OBJECTS_FILTER(&batch.objects_filter),
      		OPT_END()
      	};
      
    @@ builtin/cat-file.c: int cmd_cat_file(int argc,
      	if (opt == 'b')
      		batch.all_objects = 1;
     @@ builtin/cat-file.c: int cmd_cat_file(int argc,
    - 			usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage,
    - 				      options);
    + 			usage_msg_opt(_("batch modes take no arguments"),
    + 				      builtin_catfile_usage, options);
      
     -		return batch_objects(&batch);
     +		ret = batch_objects(&batch);
    @@ t/t1006-cat-file.sh: test_expect_success PERL '--batch-command info is unbuffere
     +		test_cmp expect err
     +	'
     +done
    ++
    ++test_expect_success 'objects filter: disabled' '
    ++	git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --no-filter >actual &&
    ++	sort actual >actual.sorted &&
    ++	git -C repo rev-list --objects --no-object-names --all >expect &&
    ++	sort expect >expect.sorted &&
    ++	test_cmp expect.sorted actual.sorted
    ++'
     +
      test_done
 3:  ec1d0c63de6 !  4:  878ae8e2a76 builtin/cat-file: support "blob:none" objects filter
    @@ Commit message
         Implement support for the "blob:none" filter in git-cat-file(1), which
         causes us to omit all blobs.
     
    +    Note that this new filter requires us to read the object type via
    +    `oid_object_info_extended()` in `batch_object_write()`. But as we try to
    +    optimize away reading objects from the database the `data->info.typep`
    +    pointer may not be set. We thus have to adapt the logic to conditionally
    +    set the pointer in cases where the filter is given.
    +
         Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
     
      ## Documentation/git-cat-file.adoc ##
     @@ Documentation/git-cat-file.adoc: OPTIONS
    - 	Omit objects from the list of printed objects. This can only be used in
    - 	combination with one of the batched modes. The '<filter-spec>' may be
    - 	one of the following:
    + 	been explicitly requested via any of the batch modes that read objects
    + 	via standard input (`--batch`, `--batch-check`) will be reported as
    + 	"filtered". Excluded objects in `--batch-all-objects` mode will not be
    +-	printed at all. No filters are supported yet.
    ++	printed at all. The '<filter-spec>' may be one of the following:
     ++
     +The form '--filter=blob:none' omits all blobs.
      
    @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
      		case LOFC_DISABLED:
      			break;
     +		case LOFC_BLOB_NONE:
    -+			if (data->type == OBJ_BLOB)
    ++			if (data->type == OBJ_BLOB) {
    ++				if (!opt->all_objects)
    ++					report_object_status(opt, obj_name,
    ++							     &data->oid, "excluded");
     +				return;
    ++			}
     +			break;
      		default:
      			BUG("unsupported objects filter");
    @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' '
      do
      	test_expect_success "objects filter with unsupported option $option" '
      		case "$option" in
    -@@ t/t1006-cat-file.sh: do
    - 	'
    - done
    +@@ t/t1006-cat-file.sh: test_expect_success 'objects filter: disabled' '
    + 	test_cmp expect.sorted actual.sorted
    + '
      
     +test_objects_filter () {
     +	filter="$1"
    @@ t/t1006-cat-file.sh: do
     +		sort expect >expect.sorted &&
     +		test_cmp expect.sorted actual.sorted
     +	'
    ++
    ++	test_expect_success "objects filter prints excluded objects: $filter" '
    ++		# Find all objects that would be excluded by the current filter.
    ++		git -C repo rev-list --objects --no-object-names --all >all &&
    ++		git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >filtered &&
    ++		sort all >all.sorted &&
    ++		sort filtered >filtered.sorted &&
    ++		comm -23 all.sorted filtered.sorted >expected.excluded &&
    ++		test_line_count -gt 0 expected.excluded &&
    ++
    ++		git -C repo cat-file --batch-check="%(objectname)" --filter="$filter" <expected.excluded >actual &&
    ++		awk "/excluded/{ print \$1 }" actual | sort >actual.excluded &&
    ++		test_cmp expected.excluded actual.excluded
    ++	'
     +}
     +
     +test_objects_filter "blob:none"
 4:  a3ed054994d !  5:  a88d5d4b60a builtin/cat-file: support "blob:limit=" objects filter
    @@ Commit message
     
      ## Documentation/git-cat-file.adoc ##
     @@ Documentation/git-cat-file.adoc: OPTIONS
    - 	one of the following:
    + 	printed at all. The '<filter-spec>' may be one of the following:
      +
      The form '--filter=blob:none' omits all blobs.
     ++
    @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
      		if (pack)
      			ret = packed_object_info(the_repository, pack, offset,
     @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
    - 			if (data->type == OBJ_BLOB)
      				return;
    + 			}
      			break;
     +		case LOFC_BLOB_LIMIT:
     +			if (data->type == OBJ_BLOB &&
    -+			    data->size >= opt->objects_filter.blob_limit_value)
    ++			    data->size >= opt->objects_filter.blob_limit_value) {
    ++				if (!opt->all_objects)
    ++					report_object_status(opt, obj_name,
    ++							     &data->oid, "excluded");
     +				return;
    ++			}
     +			break;
      		default:
      			BUG("unsupported objects filter");
    @@ t/t1006-cat-file.sh: test_objects_filter () {
     +test_objects_filter "blob:limit=1"
     +test_objects_filter "blob:limit=500"
     +test_objects_filter "blob:limit=1000"
    -+test_objects_filter "blob:limit=1g"
    ++test_objects_filter "blob:limit=1k"
      
      test_done
 5:  8e39cd218c2 !  6:  13be54300c9 builtin/cat-file: support "object:type=" objects filter
    @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
      		if (opt->objects_filter.choice == LOFC_BLOB_LIMIT)
      			data->info.sizep = &data->size;
     @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name,
    - 			    data->size >= opt->objects_filter.blob_limit_value)
      				return;
    + 			}
      			break;
     +		case LOFC_OBJECT_TYPE:
    -+			if (data->type != opt->objects_filter.object_type)
    ++			if (data->type != opt->objects_filter.object_type) {
    ++				if (!opt->all_objects)
    ++					report_object_status(opt, obj_name,
    ++							     &data->oid, "excluded");
     +				return;
    ++			}
     +			break;
      		default:
      			BUG("unsupported objects filter");
    @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' '
     @@ t/t1006-cat-file.sh: test_objects_filter "blob:limit=1"
      test_objects_filter "blob:limit=500"
      test_objects_filter "blob:limit=1000"
    - test_objects_filter "blob:limit=1g"
    + test_objects_filter "blob:limit=1k"
     +test_objects_filter "object:type=blob"
     +test_objects_filter "object:type=commit"
     +test_objects_filter "object:type=tag"
 6:  a0655de3ace =  7:  d525a5bc2ef pack-bitmap: allow passing payloads to `show_reachable_fn()`
 7:  e1e44303dac =  8:  e3cc1ae3a87 pack-bitmap: add function to iterate over filtered bitmapped objects
 8:  23bc040bb15 =  9:  c0fc0e4ce0c pack-bitmap: introduce function to check whether a pack is bitmapped
 9:  4eba2a70619 = 10:  28ef93dceec builtin/cat-file: deduplicate logic to iterate over all objects
10:  d40f1924ef5 = 11:  842a6002c50 builtin/cat-file: use bitmaps to efficiently filter by object type

---
base-commit: 003c5f45b8447877015b2a23ceab2297638fe1f1
change-id: 20250220-pks-cat-file-object-type-filter-9140c0ed5ee1





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux