Patrick Steinhardt <ps@xxxxxx> writes: > On Fri, May 23, 2025 at 11:33:50AM +0200, Toon Claes wrote: >> Our 'git last-modified' performs a revision walk, and computes a diff at >> each point in the walk to figure out whether a given revision changed >> any of the paths it considers interesting. >> >> When changed-path Bloom filters are available, we can avoid computing >> many such diffs. Before computing a diff, we first check if any of the >> remaining paths of interest were possibly changed at a given commit by >> consulting its Bloom filter. If any of them are, we are resigned to >> compute the diff. >> >> If none of those queries returned "maybe", we know that the given commit >> doesn't contain any changed paths which are interesting to us. So, we >> can avoid computing it in this case. >> >> This results in a substantial performance speed-up in common cases of >> 'git last-modified'. In the kernel, here is the before and after (all >> times computed with best-of-five): >> >> With commit-graphs (but no Bloom filters): >> >> real 0m5.133s >> user 0m4.942s >> sys 0m0.180s >> >> ...and with Bloom filters: >> >> real 0m0.936s >> user 0m0.842s >> sys 0m0.092s >> >> These times are with my development-version of Git, so it's compiled >> without optimizations. Compiling instead with `-O3`, the results look >> even better: >> >> real 0m0.754s >> user 0m0.661s >> sys 0m0.092s > > I'm sure that the old state without bloom filters will also improve a > bit? These are the benchmarks from the original commits I took over. They are no longer really relevant, I'll remove them. >> Signed-off-by: Toon Claes <toon@xxxxxxxxx> >> --- >> last-modified.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 44 insertions(+) >> >> diff --git a/last-modified.c b/last-modified.c >> index 9283f8fcae..f628434929 100644 >> --- a/last-modified.c >> +++ b/last-modified.c >> @@ -92,12 +99,21 @@ void last_modified_init(struct last_modified *lm, >> if (setup_revisions(argc, argv, &lm->rev, NULL) > 1) >> die(_("unknown last-modified argument: %s"), argv[1]); >> >> + (void)generation_numbers_enabled(lm->rev.repo); > > Why the `(void)` cast? And why even call this in the first place? This > definitely needs a comment and smells like funky design in our commit > graph subsystem where we rely on side effects of one function to leak > into a different function. This function calls `prepare_commit_graph()` which I think is the important side-effect. Let me add a comment. Or would you rather to see a separate function? >> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo); >> + >> if (add_from_revs(lm) < 0) >> die(_("unable to setup last-modified")); >> } >> >> void last_modified_release(struct last_modified *lm) >> { >> + struct hashmap_iter iter; >> + struct last_modified_entry *ent; >> + >> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) { >> + clear_bloom_key(&ent->key); >> + } > > The curly braces shouldn't be needed. Okay. >> @@ -180,6 +197,30 @@ static void last_modified_diff(struct diff_queue_struct *q, >> } >> } >> >> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin) >> +{ >> + struct bloom_filter *filter; >> + struct last_modified_entry *ent; >> + struct hashmap_iter iter; >> + >> + if (!lm->rev.bloom_filter_settings) >> + return 1; >> + >> + if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY) >> + return 1; > > Hm, okay, so here we require generation numbers to exist. Why is that > though? Shouldn't we only care about bloom filters? I don't quite get > that part yet. That's a good question. Because we're above ignoring the return value of `generation_numbers_enabled()` we shouldn't rely on generation numbers. I verified things and did some testing and it seems to me we can safely remove this condition. >> + filter = get_bloom_filter(lm->rev.repo, origin); >> + if (!filter) >> + return 1; >> + >> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) { >> + if (bloom_filter_contains(filter, &ent->key, >> + lm->rev.bloom_filter_settings)) >> + return 1; >> + } >> + return 0; >> +} >> + > > Okay, and here we check whether any of our desired paths may be > contained in the bloom filter. > >> int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata) >> { >> struct last_modified_callback_data data; >> @@ -199,6 +240,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void >> if (!data.commit) >> break; >> >> + if (!maybe_changed_path(lm, data.commit)) >> + continue; > > If there either are no bloom filters or in case none of them contain our > commit we can safely skip over the commit indeed. Otherwise we'll have > to check whether the commit really is interesting. > > Makes sense. > > Patrick > -- Cheers, Toon