On Wed, Jul 16, 2025 at 03:35:15PM +0200, Toon Claes wrote: > Our 'git last-modified' performs a revision walk, and computes a diff at > each point in the walk to figure out whether a given revision changed > any of the paths it considers interesting. > > When changed-path Bloom filters are available, we can avoid computing > many such diffs. Before computing a diff, we first check if any of the > remaining paths of interest were possibly changed at a given commit by > consulting its Bloom filter. If any of them are, we are resigned to > compute the diff. > > If none of those queries returned "maybe", we know that the given commit > doesn't contain any changed paths which are interesting to us. So, we > can avoid computing it in this case. > > Comparing the perf test results on git.git: > > Test HEAD~ HEAD > ------------------------------------------------------------------------------------ > 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6% > 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4% > 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4% As an aside on 8020.3 (that I probably should have mentioned in the last commit), I think that our "| head -n1" heuristic for picking a sub-tree is skewing these results down. In git.git, the lexicographically earliest sub-tree is ".github", which is awfully tiny. I wonder if we should be grabbing the *last* sub-tree, or maybe the largest one by count of entries? > @@ -40,6 +43,12 @@ struct last_modified { > > static void last_modified_release(struct last_modified *lm) > { > + struct hashmap_iter iter; > + struct last_modified_entry *ent; > + > + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) > + clear_bloom_key(&ent->key); > + I did a double-take to make sure that ent->key would always be initialized here, but it is thanks to the FLEX_ALLOC_STR() call below. > hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent); > release_revisions(&lm->rev); > } > @@ -67,6 +76,9 @@ static void add_path_from_diff(struct diff_queue_struct *q, > > FLEX_ALLOC_STR(ent, path, path); > oidcpy(&ent->oid, &p->two->oid); > + if (lm->rev.bloom_filter_settings) > + fill_bloom_key(path, strlen(path), &ent->key, > + lm->rev.bloom_filter_settings); > hashmap_entry_init(&ent->hashent, strhash(ent->path)); > hashmap_add(&lm->paths, &ent->hashent); > } > @@ -126,6 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid, > data->callback(path, data->commit, data->callback_data); > > hashmap_remove(data->paths, &ent->hashent, path); > + clear_bloom_key(&ent->key); OK, we're calling clear_bloom_key() here, too, but it uses FREE_AND_NULL(), so calling it again in last_modified_release() may be a noop, but won't ever be a double-free. > @@ -238,6 +276,13 @@ static int last_modified_init(struct last_modified *lm, struct repository *r, > return argc; > } > > + /* > + * We're not interested in generation numbers here, > + * but calling this function to prepare the commit-graph. > + */ > + (void)generation_numbers_enabled(lm->rev.repo); > + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo); Hmmph. I think when I originally wrote this I was using the side-effect of calling generation_numbers_enabled() as a hack. But I think that it may be worth making "prepare_commit_graph()" a non-static function and calling that instead. Thanks, Taylor