Re: [PATCH RFC v2 3/5] last-modified: use Bloom filters when available

Toon Claes <toon@xxxxxxxxx> · Fri, 13 Jun 2025 13:05:43 +0200

Patrick Steinhardt <ps@xxxxxx> writes:

> On Fri, May 23, 2025 at 11:33:50AM +0200, Toon Claes wrote:
>> Our 'git last-modified' performs a revision walk, and computes a diff at
>> each point in the walk to figure out whether a given revision changed
>> any of the paths it considers interesting.
>> 
>> When changed-path Bloom filters are available, we can avoid computing
>> many such diffs. Before computing a diff, we first check if any of the
>> remaining paths of interest were possibly changed at a given commit by
>> consulting its Bloom filter. If any of them are, we are resigned to
>> compute the diff.
>> 
>> If none of those queries returned "maybe", we know that the given commit
>> doesn't contain any changed paths which are interesting to us. So, we
>> can avoid computing it in this case.
>> 
>> This results in a substantial performance speed-up in common cases of
>> 'git last-modified'. In the kernel, here is the before and after (all
>> times computed with best-of-five):
>> 
>> With commit-graphs (but no Bloom filters):
>> 
>>     real	0m5.133s
>>     user	0m4.942s
>>     sys	0m0.180s
>> 
>> ...and with Bloom filters:
>> 
>>     real	0m0.936s
>>     user	0m0.842s
>>     sys	0m0.092s
>> 
>> These times are with my development-version of Git, so it's compiled
>> without optimizations. Compiling instead with `-O3`, the results look
>> even better:
>> 
>>     real	0m0.754s
>>     user	0m0.661s
>>     sys	0m0.092s
>
> I'm sure that the old state without bloom filters will also improve a
> bit?

These are the benchmarks from the original commits I took over. They are
no longer really relevant, I'll remove them.

>> Signed-off-by: Toon Claes <toon@xxxxxxxxx>
>> ---
>>  last-modified.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 44 insertions(+)
>> 
>> diff --git a/last-modified.c b/last-modified.c
>> index 9283f8fcae..f628434929 100644
>> --- a/last-modified.c
>> +++ b/last-modified.c
>> @@ -92,12 +99,21 @@ void last_modified_init(struct last_modified *lm,
>>  	if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
>>  		die(_("unknown last-modified argument: %s"), argv[1]);
>>  
>> +	(void)generation_numbers_enabled(lm->rev.repo);
>
> Why the `(void)` cast? And why even call this in the first place? This
> definitely needs a comment and smells like funky design in our commit
> graph subsystem where we rely on side effects of one function to leak
> into a different function.

This function calls `prepare_commit_graph()` which I think is the
important side-effect. Let me add a comment. Or would you rather to see
a separate function?

>> +	lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
>> +
>>  	if (add_from_revs(lm) < 0)
>>  		die(_("unable to setup last-modified"));
>>  }
>>  
>>  void last_modified_release(struct last_modified *lm)
>>  {
>> +	struct hashmap_iter iter;
>> +	struct last_modified_entry *ent;
>> +
>> +	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
>> +		clear_bloom_key(&ent->key);
>> +	}
>
> The curly braces shouldn't be needed.

Okay.

>> @@ -180,6 +197,30 @@ static void last_modified_diff(struct diff_queue_struct *q,
>>  	}
>>  }
>>  
>> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
>> +{
>> +	struct bloom_filter *filter;
>> +	struct last_modified_entry *ent;
>> +	struct hashmap_iter iter;
>> +
>> +	if (!lm->rev.bloom_filter_settings)
>> +		return 1;
>> +
>> +	if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
>> +		return 1;
>
> Hm, okay, so here we require generation numbers to exist. Why is that
> though? Shouldn't we only care about bloom filters? I don't quite get
> that part yet.

That's a good question. Because we're above ignoring the return value of
`generation_numbers_enabled()` we shouldn't rely on generation numbers.
I verified things and did some testing and it seems to me we can safely
remove this condition.

>> +	filter = get_bloom_filter(lm->rev.repo, origin);
>> +	if (!filter)
>> +		return 1;
>> +
>> +	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
>> +		if (bloom_filter_contains(filter, &ent->key,
>> +					  lm->rev.bloom_filter_settings))
>> +			return 1;
>> +	}
>> +	return 0;
>> +}
>> +
>
> Okay, and here we check whether any of our desired paths may be
> contained in the bloom filter.
>
>>  int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
>>  {
>>  	struct last_modified_callback_data data;
>> @@ -199,6 +240,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
>>  		if (!data.commit)
>>  			break;
>>  
>> +		if (!maybe_changed_path(lm, data.commit))
>> +			continue;
>
> If there either are no bloom filters or in case none of them contain our
> commit we can safely skip over the commit indeed. Otherwise we'll have
> to check whether the commit really is interesting.
>
> Makes sense.
>
> Patrick
>

-- 
Cheers,
Toon