Re: [PATCH RFC v2 3/5] last-modified: use Bloom filters when available

Patrick Steinhardt <ps@xxxxxx> · Tue, 27 May 2025 12:40:02 +0200

On Fri, May 23, 2025 at 11:33:50AM +0200, Toon Claes wrote:
> Our 'git last-modified' performs a revision walk, and computes a diff at
> each point in the walk to figure out whether a given revision changed
> any of the paths it considers interesting.
> 
> When changed-path Bloom filters are available, we can avoid computing
> many such diffs. Before computing a diff, we first check if any of the
> remaining paths of interest were possibly changed at a given commit by
> consulting its Bloom filter. If any of them are, we are resigned to
> compute the diff.
> 
> If none of those queries returned "maybe", we know that the given commit
> doesn't contain any changed paths which are interesting to us. So, we
> can avoid computing it in this case.
> 
> This results in a substantial performance speed-up in common cases of
> 'git last-modified'. In the kernel, here is the before and after (all
> times computed with best-of-five):
> 
> With commit-graphs (but no Bloom filters):
> 
>     real	0m5.133s
>     user	0m4.942s
>     sys	0m0.180s
> 
> ...and with Bloom filters:
> 
>     real	0m0.936s
>     user	0m0.842s
>     sys	0m0.092s
> 
> These times are with my development-version of Git, so it's compiled
> without optimizations. Compiling instead with `-O3`, the results look
> even better:
> 
>     real	0m0.754s
>     user	0m0.661s
>     sys	0m0.092s

I'm sure that the old state without bloom filters will also improve a
bit?

> Signed-off-by: Toon Claes <toon@xxxxxxxxx>
> ---
>  last-modified.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/last-modified.c b/last-modified.c
> index 9283f8fcae..f628434929 100644
> --- a/last-modified.c
> +++ b/last-modified.c
> @@ -92,12 +99,21 @@ void last_modified_init(struct last_modified *lm,
>  	if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
>  		die(_("unknown last-modified argument: %s"), argv[1]);
>  
> +	(void)generation_numbers_enabled(lm->rev.repo);

Why the `(void)` cast? And why even call this in the first place? This
definitely needs a comment and smells like funky design in our commit
graph subsystem where we rely on side effects of one function to leak
into a different function.

> +	lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
> +
>  	if (add_from_revs(lm) < 0)
>  		die(_("unable to setup last-modified"));
>  }
>  
>  void last_modified_release(struct last_modified *lm)
>  {
> +	struct hashmap_iter iter;
> +	struct last_modified_entry *ent;
> +
> +	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
> +		clear_bloom_key(&ent->key);
> +	}

The curly braces shouldn't be needed.

> @@ -180,6 +197,30 @@ static void last_modified_diff(struct diff_queue_struct *q,
>  	}
>  }
>  
> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> +{
> +	struct bloom_filter *filter;
> +	struct last_modified_entry *ent;
> +	struct hashmap_iter iter;
> +
> +	if (!lm->rev.bloom_filter_settings)
> +		return 1;
> +
> +	if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
> +		return 1;

Hm, okay, so here we require generation numbers to exist. Why is that
though? Shouldn't we only care about bloom filters? I don't quite get
that part yet.

> +	filter = get_bloom_filter(lm->rev.repo, origin);
> +	if (!filter)
> +		return 1;
> +
> +	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
> +		if (bloom_filter_contains(filter, &ent->key,
> +					  lm->rev.bloom_filter_settings))
> +			return 1;
> +	}
> +	return 0;
> +}
> +

Okay, and here we check whether any of our desired paths may be
contained in the bloom filter.

>  int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
>  {
>  	struct last_modified_callback_data data;
> @@ -199,6 +240,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
>  		if (!data.commit)
>  			break;
>  
> +		if (!maybe_changed_path(lm, data.commit))
> +			continue;

If there either are no bloom filters or in case none of them contain our
commit we can safely skip over the commit indeed. Otherwise we'll have
to check whether the commit really is interesting.

Makes sense.

Patrick