Patrick Steinhardt <ps@xxxxxx> writes: > Hi, > > at GitLab, we sometimes have the need to list all objects regardless of > their reachability. We use git-cat-file(1) with `--batch-all-objects` to > do this, and typically this is quite a good fit. In some cases though, > we only want to list objects of a specific type, where we then basically > have the following pipeline: > > git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | > grep '^commit ' | > cut -d' ' -f2 | > git cat-file --batch > > This works okayish in medium-sized repositories, but once you reach a > certain size this isn't really an option anymore. In the Chromium > repository for example [1] simply listing all objects in the first > invocation of git-cat-file(1) takes around 80 to 100 seconds. The > workload is completely I/O-bottlenecked: my machine reads at ~500MB/s, > and the packfile is 50GB in size, which matches the 100 seconds that I > observe. > > This series addresses the issue by introducing object filters into > git-cat-file(1). These object filters use the exact same syntax as the > filters we have in git-rev-list(1), but only a subset of them is > supported because not all filters can be computed by git-cat-file(1). > Supported are "blob:none", "blob:limit=" as well as "object:type=". > > The filters alone don't really help though: we still have to scan > through the whole packfile in order to compute the packfiles. While we > are able to shed a bit of CPU time because we can stop emitting some of > the objects, we're still I/O-bottlenecked. > > The second part of the series thus expands the filters so that they can > make use of bitmap indices for some of the filters, if available. This > allows us to efficiently answer the question where to find all objects > of a specific type, and thus we can avoid scanning through the packfile > and instead directly look up relevant objects, leading to a significant > speedup: > > Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~) > Time (mean ± σ): 86.444 s ± 4.081 s [User: 36.830 s, System: 11.312 s] > Range (min … max): 80.305 s … 93.104 s 10 runs > > Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD) > Time (mean ± σ): 2.089 s ± 0.015 s [User: 1.872 s, System: 0.207 s] > Range (min … max): 2.073 s … 2.119 s 10 runs > > Summary > cat-file with filter=object:type=commit (revision = HEAD) ran > 41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~) > > We now directly scale with the number of objects of a specific type > contained in the packfile instead of scaling with the overall number of > objects. It's quite fun to see how the math plays out: if you sum up the > times for each of the types you arrive at the time for the unfiltered > case. > > Changes in v2: > - The series is now built on top of "master" at 683c54c999c (Git 2.49, > 2025-03-14) with "tb/incremental-midx-part-2" at 27afc272c49 (midx: > implement writing incremental MIDX bitmaps, 2025-03-20) merged into > it. > - Rename the filter options to "--filter=" to match > git-pack-objects(1). > - The bitmap-filtering is now reusing existing mechanisms that we > already have in "pack-bitmap.c", as proposed by Taylor. > - Link to v1: https://lore.kernel.org/r/20250221-pks-cat-file-object-type-filter-v1-0-0852530888e2@xxxxxx > > Changes in v3: > - Wrap some overly long lines. > - Better describe how filters interact with the different batch modes. > - Adapt the format with `--batch` and `--batch-check` so that we tell > the user that the object has been excluded. > - Add a test for "--no-filter". > - Use `OPT_PARSE_LIST_OBJECTS_FILTER()`. > - Link to v2: https://lore.kernel.org/r/20250327-pks-cat-file-object-type-filter-v2-0-4bbc7085d7c5@xxxxxx > > Thanks! > > Patrick > > [1]: https://github.com/chromium/chromium.git > > --- > Patrick Steinhardt (11): > builtin/cat-file: rename variable that tracks usage > builtin/cat-file: introduce function to report object status > builtin/cat-file: wire up an option to filter objects > builtin/cat-file: support "blob:none" objects filter > builtin/cat-file: support "blob:limit=" objects filter > builtin/cat-file: support "object:type=" objects filter > pack-bitmap: allow passing payloads to `show_reachable_fn()` > pack-bitmap: add function to iterate over filtered bitmapped objects > pack-bitmap: introduce function to check whether a pack is bitmapped > builtin/cat-file: deduplicate logic to iterate over all objects > builtin/cat-file: use bitmaps to efficiently filter by object type > > Documentation/git-cat-file.adoc | 26 ++++ > builtin/cat-file.c | 256 +++++++++++++++++++++++++++++----------- > builtin/pack-objects.c | 3 +- > builtin/rev-list.c | 3 +- > pack-bitmap.c | 81 +++++++++++-- > pack-bitmap.h | 22 +++- > reachable.c | 3 +- > t/t1006-cat-file.sh | 99 ++++++++++++++++ > 8 files changed, 411 insertions(+), 82 deletions(-) > > Range-diff versus v2: > > 1: a75888e0bf4 ! 1: b0642b6c495 builtin/cat-file: rename variable that tracks usage > @@ builtin/cat-file.c: int cmd_cat_file(int argc, > ; > else if (batch.follow_symlinks) > - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, > -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, > - "--follow-symlinks"); > +- "--follow-symlinks"); > ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, > ++ options, "--follow-symlinks"); > else if (batch.buffer_output >= 0) > - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, > -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, > - "--buffer"); > +- "--buffer"); > ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, > ++ options, "--buffer"); > else if (batch.all_objects) > - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, > -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, > - "--batch-all-objects"); > +- "--batch-all-objects"); > ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, > ++ options, "--batch-all-objects"); > else if (input_nul_terminated) > - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, > -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, > - "-z"); > +- "-z"); > ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, > ++ options, "-z"); > else if (nul_terminated) > - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, > -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, > - "-Z"); > +- "-Z"); > ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, > ++ options, "-Z"); > > batch.input_delim = batch.output_delim = '\n'; > + if (input_nul_terminated) > @@ builtin/cat-file.c: int cmd_cat_file(int argc, > batch.transform_mode = opt; > else if (opt && opt != 'b') > @@ builtin/cat-file.c: int cmd_cat_file(int argc, > + builtin_catfile_usage, options, opt); > else if (argc) > - usage_msg_opt(_("batch modes take no arguments"), usage, > -+ usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage, > - options); > +- options); > ++ usage_msg_opt(_("batch modes take no arguments"), > ++ builtin_catfile_usage, options); > > return batch_objects(&batch); > + } > @@ builtin/cat-file.c: int cmd_cat_file(int argc, > if (opt) { > if (!argc && opt == 'c') > usage_msg_optf(_("<rev> required with '%s'"), > - usage, options, "--textconv"); > -+ builtin_catfile_usage, options, "--textconv"); > ++ builtin_catfile_usage, options, > ++ "--textconv"); > else if (!argc && opt == 'w') > usage_msg_optf(_("<rev> required with '%s'"), > - usage, options, "--filters"); > -+ builtin_catfile_usage, options, "--filters"); > ++ builtin_catfile_usage, options, > ++ "--filters"); > else if (!argc && opt_epts) > usage_msg_optf(_("<object> required with '-%c'"), > - usage, options, opt); > @@ builtin/cat-file.c: int cmd_cat_file(int argc, > obj_name = argv[0]; > else > - usage_msg_opt(_("too many arguments"), usage, options); > -+ usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options); > ++ usage_msg_opt(_("too many arguments"), builtin_catfile_usage, > ++ options); > } else if (!argc) { > - usage_with_options(usage, options); > + usage_with_options(builtin_catfile_usage, options); > -: ----------- > 2: 18353ba706d builtin/cat-file: introduce function to report object status > 2: bee9407c1a9 ! 3: 1e46af5d07b builtin/cat-file: wire up an option to filter objects > @@ Documentation/git-cat-file.adoc: OPTIONS > +--filter=<filter-spec>:: > +--no-filter:: > + Omit objects from the list of printed objects. This can only be used in > -+ combination with one of the batched modes. The '<filter-spec>' may be > -+ one of the following: > ++ combination with one of the batched modes. Excluded objects that have > ++ been explicitly requested via any of the batch modes that read objects > ++ via standard input (`--batch`, `--batch-check`) will be reported as > ++ "filtered". Excluded objects in `--batch-all-objects` mode will not be > ++ printed at all. No filters are supported yet. > + > --path=<path>:: > For use with `--textconv` or `--filters`, to allow specifying an object > name and a path separately, e.g. when it is difficult to figure out > +@@ Documentation/git-cat-file.adoc: the repository, then `cat-file` will ignore any custom format and print: > + <object> SP missing LF > + ------------ > + > ++If a name is specified on stdin that is filtered out via `--filter=`, > ++then `cat-file` will ignore any custom format and print: > ++ > ++------------ > ++<object> SP excluded LF > ++------------ > ++ > + If a name is specified that might refer to more than one object (an ambiguous short sha), then `cat-file` will ignore any custom format and print: > + > + ------------ > > ## builtin/cat-file.c ## > @@ > @@ builtin/cat-file.c: int cmd_cat_file(int argc, > N_("run filters on object's content"), 'w'), > OPT_STRING(0, "path", &force_path, N_("blob|tree"), > N_("use a <path> for (--textconv | --filters); Not with 'batch'")), > -+ OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"), > -+ N_("object filtering"), opt_parse_list_objects_filter), > ++ OPT_PARSE_LIST_OBJECTS_FILTER(&batch.objects_filter), > OPT_END() > }; > > @@ builtin/cat-file.c: int cmd_cat_file(int argc, > if (opt == 'b') > batch.all_objects = 1; > @@ builtin/cat-file.c: int cmd_cat_file(int argc, > - usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage, > - options); > + usage_msg_opt(_("batch modes take no arguments"), > + builtin_catfile_usage, options); > > - return batch_objects(&batch); > + ret = batch_objects(&batch); > @@ t/t1006-cat-file.sh: test_expect_success PERL '--batch-command info is unbuffere > + test_cmp expect err > + ' > +done > ++ > ++test_expect_success 'objects filter: disabled' ' > ++ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --no-filter >actual && > ++ sort actual >actual.sorted && > ++ git -C repo rev-list --objects --no-object-names --all >expect && > ++ sort expect >expect.sorted && > ++ test_cmp expect.sorted actual.sorted > ++' > + > test_done > 3: ec1d0c63de6 ! 4: 878ae8e2a76 builtin/cat-file: support "blob:none" objects filter > @@ Commit message > Implement support for the "blob:none" filter in git-cat-file(1), which > causes us to omit all blobs. > > + Note that this new filter requires us to read the object type via > + `oid_object_info_extended()` in `batch_object_write()`. But as we try to > + optimize away reading objects from the database the `data->info.typep` > + pointer may not be set. We thus have to adapt the logic to conditionally > + set the pointer in cases where the filter is given. > + > Signed-off-by: Patrick Steinhardt <ps@xxxxxx> > > ## Documentation/git-cat-file.adoc ## > @@ Documentation/git-cat-file.adoc: OPTIONS > - Omit objects from the list of printed objects. This can only be used in > - combination with one of the batched modes. The '<filter-spec>' may be > - one of the following: > + been explicitly requested via any of the batch modes that read objects > + via standard input (`--batch`, `--batch-check`) will be reported as > + "filtered". Excluded objects in `--batch-all-objects` mode will not be > +- printed at all. No filters are supported yet. > ++ printed at all. The '<filter-spec>' may be one of the following: > ++ > +The form '--filter=blob:none' omits all blobs. > > @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, > case LOFC_DISABLED: > break; > + case LOFC_BLOB_NONE: > -+ if (data->type == OBJ_BLOB) > ++ if (data->type == OBJ_BLOB) { > ++ if (!opt->all_objects) > ++ report_object_status(opt, obj_name, > ++ &data->oid, "excluded"); > + return; > ++ } > + break; > default: > BUG("unsupported objects filter"); > @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' ' > do > test_expect_success "objects filter with unsupported option $option" ' > case "$option" in > -@@ t/t1006-cat-file.sh: do > - ' > - done > +@@ t/t1006-cat-file.sh: test_expect_success 'objects filter: disabled' ' > + test_cmp expect.sorted actual.sorted > + ' > > +test_objects_filter () { > + filter="$1" > @@ t/t1006-cat-file.sh: do > + sort expect >expect.sorted && > + test_cmp expect.sorted actual.sorted > + ' > ++ > ++ test_expect_success "objects filter prints excluded objects: $filter" ' > ++ # Find all objects that would be excluded by the current filter. > ++ git -C repo rev-list --objects --no-object-names --all >all && > ++ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >filtered && > ++ sort all >all.sorted && > ++ sort filtered >filtered.sorted && > ++ comm -23 all.sorted filtered.sorted >expected.excluded && > ++ test_line_count -gt 0 expected.excluded && > ++ > ++ git -C repo cat-file --batch-check="%(objectname)" --filter="$filter" <expected.excluded >actual && > ++ awk "/excluded/{ print \$1 }" actual | sort >actual.excluded && > ++ test_cmp expected.excluded actual.excluded > ++ ' > +} > + > +test_objects_filter "blob:none" > 4: a3ed054994d ! 5: a88d5d4b60a builtin/cat-file: support "blob:limit=" objects filter > @@ Commit message > > ## Documentation/git-cat-file.adoc ## > @@ Documentation/git-cat-file.adoc: OPTIONS > - one of the following: > + printed at all. The '<filter-spec>' may be one of the following: > + > The form '--filter=blob:none' omits all blobs. > ++ > @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, > if (pack) > ret = packed_object_info(the_repository, pack, offset, > @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, > - if (data->type == OBJ_BLOB) > return; > + } > break; > + case LOFC_BLOB_LIMIT: > + if (data->type == OBJ_BLOB && > -+ data->size >= opt->objects_filter.blob_limit_value) > ++ data->size >= opt->objects_filter.blob_limit_value) { > ++ if (!opt->all_objects) > ++ report_object_status(opt, obj_name, > ++ &data->oid, "excluded"); > + return; > ++ } > + break; > default: > BUG("unsupported objects filter"); > @@ t/t1006-cat-file.sh: test_objects_filter () { > +test_objects_filter "blob:limit=1" > +test_objects_filter "blob:limit=500" > +test_objects_filter "blob:limit=1000" > -+test_objects_filter "blob:limit=1g" > ++test_objects_filter "blob:limit=1k" > > test_done > 5: 8e39cd218c2 ! 6: 13be54300c9 builtin/cat-file: support "object:type=" objects filter > @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, > if (opt->objects_filter.choice == LOFC_BLOB_LIMIT) > data->info.sizep = &data->size; > @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, > - data->size >= opt->objects_filter.blob_limit_value) > return; > + } > break; > + case LOFC_OBJECT_TYPE: > -+ if (data->type != opt->objects_filter.object_type) > ++ if (data->type != opt->objects_filter.object_type) { > ++ if (!opt->all_objects) > ++ report_object_status(opt, obj_name, > ++ &data->oid, "excluded"); > + return; > ++ } > + break; > default: > BUG("unsupported objects filter"); > @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' ' > @@ t/t1006-cat-file.sh: test_objects_filter "blob:limit=1" > test_objects_filter "blob:limit=500" > test_objects_filter "blob:limit=1000" > - test_objects_filter "blob:limit=1g" > + test_objects_filter "blob:limit=1k" > +test_objects_filter "object:type=blob" > +test_objects_filter "object:type=commit" > +test_objects_filter "object:type=tag" > 6: a0655de3ace = 7: d525a5bc2ef pack-bitmap: allow passing payloads to `show_reachable_fn()` > 7: e1e44303dac = 8: e3cc1ae3a87 pack-bitmap: add function to iterate over filtered bitmapped objects > 8: 23bc040bb15 = 9: c0fc0e4ce0c pack-bitmap: introduce function to check whether a pack is bitmapped > 9: 4eba2a70619 = 10: 28ef93dceec builtin/cat-file: deduplicate logic to iterate over all objects > 10: d40f1924ef5 = 11: 842a6002c50 builtin/cat-file: use bitmaps to efficiently filter by object type > Thanks for the new version, the range-diff looks good. Good that you also added a test for "excluded" message too. > --- > base-commit: 003c5f45b8447877015b2a23ceab2297638fe1f1 > change-id: 20250220-pks-cat-file-object-type-filter-9140c0ed5ee1
Attachment:
signature.asc
Description: PGP signature