Hi, at GitLab, we sometimes have the need to list all objects regardless of their reachability. We use git-cat-file(1) with `--batch-all-objects` to do this, and typically this is quite a good fit. In some cases though, we only want to list objects of a specific type, where we then basically have the following pipeline: git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | grep '^commit ' | cut -d' ' -f2 | git cat-file --batch This works okayish in medium-sized repositories, but once you reach a certain size this isn't really an option anymore. In the Chromium repository for example [1] simply listing all objects in the first invocation of git-cat-file(1) takes around 80 to 100 seconds. The workload is completely I/O-bottlenecked: my machine reads at ~500MB/s, and the packfile is 50GB in size, which matches the 100 seconds that I observe. This series addresses the issue by introducing object filters into git-cat-file(1). These object filters use the exact same syntax as the filters we have in git-rev-list(1), but only a subset of them is supported because not all filters can be computed by git-cat-file(1). Supported are "blob:none", "blob:limit=" as well as "object:type=". The filters alone don't really help though: we still have to scan through the whole packfile in order to compute the packfiles. While we are able to shed a bit of CPU time because we can stop emitting some of the objects, we're still I/O-bottlenecked. The second part of the series thus expands the filters so that they can make use of bitmap indices for some of the filters, if available. This allows us to efficiently answer the question where to find all objects of a specific type, and thus we can avoid scanning through the packfile and instead directly look up relevant objects, leading to a significant speedup: Benchmark 1: cat-file with filter=object:type=commit (revision = HEAD~) Time (mean ± σ): 86.444 s ± 4.081 s [User: 36.830 s, System: 11.312 s] Range (min … max): 80.305 s … 93.104 s 10 runs Benchmark 2: cat-file with filter=object:type=commit (revision = HEAD) Time (mean ± σ): 2.089 s ± 0.015 s [User: 1.872 s, System: 0.207 s] Range (min … max): 2.073 s … 2.119 s 10 runs Summary cat-file with filter=object:type=commit (revision = HEAD) ran 41.38 ± 1.98 times faster than cat-file with filter=object:type=commit (revision = HEAD~) We now directly scale with the number of objects of a specific type contained in the packfile instead of scaling with the overall number of objects. It's quite fun to see how the math plays out: if you sum up the times for each of the types you arrive at the time for the unfiltered case. Changes in v2: - The series is now built on top of "master" at 683c54c999c (Git 2.49, 2025-03-14) with "tb/incremental-midx-part-2" at 27afc272c49 (midx: implement writing incremental MIDX bitmaps, 2025-03-20) merged into it. - Rename the filter options to "--filter=" to match git-pack-objects(1). - The bitmap-filtering is now reusing existing mechanisms that we already have in "pack-bitmap.c", as proposed by Taylor. - Link to v1: https://lore.kernel.org/r/20250221-pks-cat-file-object-type-filter-v1-0-0852530888e2@xxxxxx Changes in v3: - Wrap some overly long lines. - Better describe how filters interact with the different batch modes. - Adapt the format with `--batch` and `--batch-check` so that we tell the user that the object has been excluded. - Add a test for "--no-filter". - Use `OPT_PARSE_LIST_OBJECTS_FILTER()`. - Link to v2: https://lore.kernel.org/r/20250327-pks-cat-file-object-type-filter-v2-0-4bbc7085d7c5@xxxxxx Thanks! Patrick [1]: https://github.com/chromium/chromium.git --- Patrick Steinhardt (11): builtin/cat-file: rename variable that tracks usage builtin/cat-file: introduce function to report object status builtin/cat-file: wire up an option to filter objects builtin/cat-file: support "blob:none" objects filter builtin/cat-file: support "blob:limit=" objects filter builtin/cat-file: support "object:type=" objects filter pack-bitmap: allow passing payloads to `show_reachable_fn()` pack-bitmap: add function to iterate over filtered bitmapped objects pack-bitmap: introduce function to check whether a pack is bitmapped builtin/cat-file: deduplicate logic to iterate over all objects builtin/cat-file: use bitmaps to efficiently filter by object type Documentation/git-cat-file.adoc | 26 ++++ builtin/cat-file.c | 256 +++++++++++++++++++++++++++++----------- builtin/pack-objects.c | 3 +- builtin/rev-list.c | 3 +- pack-bitmap.c | 81 +++++++++++-- pack-bitmap.h | 22 +++- reachable.c | 3 +- t/t1006-cat-file.sh | 99 ++++++++++++++++ 8 files changed, 411 insertions(+), 82 deletions(-) Range-diff versus v2: 1: a75888e0bf4 ! 1: b0642b6c495 builtin/cat-file: rename variable that tracks usage @@ builtin/cat-file.c: int cmd_cat_file(int argc, ; else if (batch.follow_symlinks) - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, - "--follow-symlinks"); +- "--follow-symlinks"); ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, ++ options, "--follow-symlinks"); else if (batch.buffer_output >= 0) - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, - "--buffer"); +- "--buffer"); ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, ++ options, "--buffer"); else if (batch.all_objects) - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, - "--batch-all-objects"); +- "--batch-all-objects"); ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, ++ options, "--batch-all-objects"); else if (input_nul_terminated) - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, - "-z"); +- "-z"); ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, ++ options, "-z"); else if (nul_terminated) - usage_msg_optf(_("'%s' requires a batch mode"), usage, options, -+ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, options, - "-Z"); +- "-Z"); ++ usage_msg_optf(_("'%s' requires a batch mode"), builtin_catfile_usage, ++ options, "-Z"); batch.input_delim = batch.output_delim = '\n'; + if (input_nul_terminated) @@ builtin/cat-file.c: int cmd_cat_file(int argc, batch.transform_mode = opt; else if (opt && opt != 'b') @@ builtin/cat-file.c: int cmd_cat_file(int argc, + builtin_catfile_usage, options, opt); else if (argc) - usage_msg_opt(_("batch modes take no arguments"), usage, -+ usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage, - options); +- options); ++ usage_msg_opt(_("batch modes take no arguments"), ++ builtin_catfile_usage, options); return batch_objects(&batch); + } @@ builtin/cat-file.c: int cmd_cat_file(int argc, if (opt) { if (!argc && opt == 'c') usage_msg_optf(_("<rev> required with '%s'"), - usage, options, "--textconv"); -+ builtin_catfile_usage, options, "--textconv"); ++ builtin_catfile_usage, options, ++ "--textconv"); else if (!argc && opt == 'w') usage_msg_optf(_("<rev> required with '%s'"), - usage, options, "--filters"); -+ builtin_catfile_usage, options, "--filters"); ++ builtin_catfile_usage, options, ++ "--filters"); else if (!argc && opt_epts) usage_msg_optf(_("<object> required with '-%c'"), - usage, options, opt); @@ builtin/cat-file.c: int cmd_cat_file(int argc, obj_name = argv[0]; else - usage_msg_opt(_("too many arguments"), usage, options); -+ usage_msg_opt(_("too many arguments"), builtin_catfile_usage, options); ++ usage_msg_opt(_("too many arguments"), builtin_catfile_usage, ++ options); } else if (!argc) { - usage_with_options(usage, options); + usage_with_options(builtin_catfile_usage, options); -: ----------- > 2: 18353ba706d builtin/cat-file: introduce function to report object status 2: bee9407c1a9 ! 3: 1e46af5d07b builtin/cat-file: wire up an option to filter objects @@ Documentation/git-cat-file.adoc: OPTIONS +--filter=<filter-spec>:: +--no-filter:: + Omit objects from the list of printed objects. This can only be used in -+ combination with one of the batched modes. The '<filter-spec>' may be -+ one of the following: ++ combination with one of the batched modes. Excluded objects that have ++ been explicitly requested via any of the batch modes that read objects ++ via standard input (`--batch`, `--batch-check`) will be reported as ++ "filtered". Excluded objects in `--batch-all-objects` mode will not be ++ printed at all. No filters are supported yet. + --path=<path>:: For use with `--textconv` or `--filters`, to allow specifying an object name and a path separately, e.g. when it is difficult to figure out +@@ Documentation/git-cat-file.adoc: the repository, then `cat-file` will ignore any custom format and print: + <object> SP missing LF + ------------ + ++If a name is specified on stdin that is filtered out via `--filter=`, ++then `cat-file` will ignore any custom format and print: ++ ++------------ ++<object> SP excluded LF ++------------ ++ + If a name is specified that might refer to more than one object (an ambiguous short sha), then `cat-file` will ignore any custom format and print: + + ------------ ## builtin/cat-file.c ## @@ @@ builtin/cat-file.c: int cmd_cat_file(int argc, N_("run filters on object's content"), 'w'), OPT_STRING(0, "path", &force_path, N_("blob|tree"), N_("use a <path> for (--textconv | --filters); Not with 'batch'")), -+ OPT_CALLBACK(0, "filter", &batch.objects_filter, N_("args"), -+ N_("object filtering"), opt_parse_list_objects_filter), ++ OPT_PARSE_LIST_OBJECTS_FILTER(&batch.objects_filter), OPT_END() }; @@ builtin/cat-file.c: int cmd_cat_file(int argc, if (opt == 'b') batch.all_objects = 1; @@ builtin/cat-file.c: int cmd_cat_file(int argc, - usage_msg_opt(_("batch modes take no arguments"), builtin_catfile_usage, - options); + usage_msg_opt(_("batch modes take no arguments"), + builtin_catfile_usage, options); - return batch_objects(&batch); + ret = batch_objects(&batch); @@ t/t1006-cat-file.sh: test_expect_success PERL '--batch-command info is unbuffere + test_cmp expect err + ' +done ++ ++test_expect_success 'objects filter: disabled' ' ++ git -C repo cat-file --batch-check="%(objectname)" --batch-all-objects --no-filter >actual && ++ sort actual >actual.sorted && ++ git -C repo rev-list --objects --no-object-names --all >expect && ++ sort expect >expect.sorted && ++ test_cmp expect.sorted actual.sorted ++' + test_done 3: ec1d0c63de6 ! 4: 878ae8e2a76 builtin/cat-file: support "blob:none" objects filter @@ Commit message Implement support for the "blob:none" filter in git-cat-file(1), which causes us to omit all blobs. + Note that this new filter requires us to read the object type via + `oid_object_info_extended()` in `batch_object_write()`. But as we try to + optimize away reading objects from the database the `data->info.typep` + pointer may not be set. We thus have to adapt the logic to conditionally + set the pointer in cases where the filter is given. + Signed-off-by: Patrick Steinhardt <ps@xxxxxx> ## Documentation/git-cat-file.adoc ## @@ Documentation/git-cat-file.adoc: OPTIONS - Omit objects from the list of printed objects. This can only be used in - combination with one of the batched modes. The '<filter-spec>' may be - one of the following: + been explicitly requested via any of the batch modes that read objects + via standard input (`--batch`, `--batch-check`) will be reported as + "filtered". Excluded objects in `--batch-all-objects` mode will not be +- printed at all. No filters are supported yet. ++ printed at all. The '<filter-spec>' may be one of the following: ++ +The form '--filter=blob:none' omits all blobs. @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, case LOFC_DISABLED: break; + case LOFC_BLOB_NONE: -+ if (data->type == OBJ_BLOB) ++ if (data->type == OBJ_BLOB) { ++ if (!opt->all_objects) ++ report_object_status(opt, obj_name, ++ &data->oid, "excluded"); + return; ++ } + break; default: BUG("unsupported objects filter"); @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' ' do test_expect_success "objects filter with unsupported option $option" ' case "$option" in -@@ t/t1006-cat-file.sh: do - ' - done +@@ t/t1006-cat-file.sh: test_expect_success 'objects filter: disabled' ' + test_cmp expect.sorted actual.sorted + ' +test_objects_filter () { + filter="$1" @@ t/t1006-cat-file.sh: do + sort expect >expect.sorted && + test_cmp expect.sorted actual.sorted + ' ++ ++ test_expect_success "objects filter prints excluded objects: $filter" ' ++ # Find all objects that would be excluded by the current filter. ++ git -C repo rev-list --objects --no-object-names --all >all && ++ git -C repo rev-list --objects --no-object-names --all --filter="$filter" --filter-provided-objects >filtered && ++ sort all >all.sorted && ++ sort filtered >filtered.sorted && ++ comm -23 all.sorted filtered.sorted >expected.excluded && ++ test_line_count -gt 0 expected.excluded && ++ ++ git -C repo cat-file --batch-check="%(objectname)" --filter="$filter" <expected.excluded >actual && ++ awk "/excluded/{ print \$1 }" actual | sort >actual.excluded && ++ test_cmp expected.excluded actual.excluded ++ ' +} + +test_objects_filter "blob:none" 4: a3ed054994d ! 5: a88d5d4b60a builtin/cat-file: support "blob:limit=" objects filter @@ Commit message ## Documentation/git-cat-file.adoc ## @@ Documentation/git-cat-file.adoc: OPTIONS - one of the following: + printed at all. The '<filter-spec>' may be one of the following: + The form '--filter=blob:none' omits all blobs. ++ @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, if (pack) ret = packed_object_info(the_repository, pack, offset, @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, - if (data->type == OBJ_BLOB) return; + } break; + case LOFC_BLOB_LIMIT: + if (data->type == OBJ_BLOB && -+ data->size >= opt->objects_filter.blob_limit_value) ++ data->size >= opt->objects_filter.blob_limit_value) { ++ if (!opt->all_objects) ++ report_object_status(opt, obj_name, ++ &data->oid, "excluded"); + return; ++ } + break; default: BUG("unsupported objects filter"); @@ t/t1006-cat-file.sh: test_objects_filter () { +test_objects_filter "blob:limit=1" +test_objects_filter "blob:limit=500" +test_objects_filter "blob:limit=1000" -+test_objects_filter "blob:limit=1g" ++test_objects_filter "blob:limit=1k" test_done 5: 8e39cd218c2 ! 6: 13be54300c9 builtin/cat-file: support "object:type=" objects filter @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, if (opt->objects_filter.choice == LOFC_BLOB_LIMIT) data->info.sizep = &data->size; @@ builtin/cat-file.c: static void batch_object_write(const char *obj_name, - data->size >= opt->objects_filter.blob_limit_value) return; + } break; + case LOFC_OBJECT_TYPE: -+ if (data->type != opt->objects_filter.object_type) ++ if (data->type != opt->objects_filter.object_type) { ++ if (!opt->all_objects) ++ report_object_status(opt, obj_name, ++ &data->oid, "excluded"); + return; ++ } + break; default: BUG("unsupported objects filter"); @@ t/t1006-cat-file.sh: test_expect_success 'objects filter with unknown option' ' @@ t/t1006-cat-file.sh: test_objects_filter "blob:limit=1" test_objects_filter "blob:limit=500" test_objects_filter "blob:limit=1000" - test_objects_filter "blob:limit=1g" + test_objects_filter "blob:limit=1k" +test_objects_filter "object:type=blob" +test_objects_filter "object:type=commit" +test_objects_filter "object:type=tag" 6: a0655de3ace = 7: d525a5bc2ef pack-bitmap: allow passing payloads to `show_reachable_fn()` 7: e1e44303dac = 8: e3cc1ae3a87 pack-bitmap: add function to iterate over filtered bitmapped objects 8: 23bc040bb15 = 9: c0fc0e4ce0c pack-bitmap: introduce function to check whether a pack is bitmapped 9: 4eba2a70619 = 10: 28ef93dceec builtin/cat-file: deduplicate logic to iterate over all objects 10: d40f1924ef5 = 11: 842a6002c50 builtin/cat-file: use bitmaps to efficiently filter by object type --- base-commit: 003c5f45b8447877015b2a23ceab2297638fe1f1 change-id: 20250220-pks-cat-file-object-type-filter-9140c0ed5ee1