Re: [PATCH v6 4/4] last-modified: use Bloom filters when available

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Patrick Steinhardt <ps@xxxxxx> writes:

> On Wed, Jul 30, 2025 at 07:55:10PM +0200, Toon Claes wrote:
>> Our 'git last-modified' performs a revision walk, and computes a diff at
>> each point in the walk to figure out whether a given revision changed
>> any of the paths it considers interesting.
>> 
>> When changed-path Bloom filters are available, we can avoid computing
>> many such diffs. Before computing a diff, we first check if any of the
>> remaining paths of interest were possibly changed at a given commit by
>> consulting its Bloom filter. If any of them are, we are resigned to
>> compute the diff.
>> 
>> If none of those queries returned "maybe", we know that the given commit
>> doesn't contain any changed paths which are interesting to us. So, we
>> can avoid computing it in this case.
>> 
>> Comparing the perf test results on git.git:
>> 
>>     Test                                        HEAD~             HEAD
>>     ------------------------------------------------------------------------------------
>>     8020.1: top-level last-modified             4.49(4.34+0.11)   2.22(2.05+0.09) -50.6%
>>     8020.2: top-level recursive last-modified   5.64(5.45+0.11)   5.62(5.30+0.11) -0.4%
>>     8020.3: subdir last-modified                0.11(0.06+0.04)   0.07(0.03+0.04) -36.4%
>
> Nice results.
>
>> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
>> index e4c73464c7..19bf25f8a5 100644
>> --- a/builtin/last-modified.c
>> +++ b/builtin/last-modified.c
>> @@ -179,6 +192,27 @@ static void last_modified_diff(struct diff_queue_struct *q,
>>  	}
>>  }
>>  
>> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
>> +{
>> +	struct bloom_filter *filter;
>> +	struct last_modified_entry *ent;
>> +	struct hashmap_iter iter;
>> +
>> +	if (!lm->rev.bloom_filter_settings)
>> +		return 1;
>> +
>> +	filter = get_bloom_filter(lm->rev.repo, origin);
>> +	if (!filter)
>> +		return 1;
>> +
>> +	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
>> +		if (bloom_filter_contains(filter, &ent->key,
>> +					  lm->rev.bloom_filter_settings))
>> +			return 1;
>> +	}
>> +	return 0;
>> +}
>
> This function is basically the same as `maybe_changed_paths()` in
> "blame.c", but that isn't a huge issue from my point of view. What makes
> me wonder though is why we have an additional check over there for
> whether or not the commit has a valid generation number.

I've been asking me the same question. And I couldn't find a good reason
(neither from the commit history, or from my reasoning). This check was
in the version shared by Taylor, but because we were ignoring the return
value from generation_numbers_enabled() in that version, it didn't make
sense to me to do this check. That's why I removed it.

>> @@ -227,6 +264,9 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
>>  		return argc;
>>  	}
>>  
>> +	prepare_commit_graph(lm->rev.repo);
>> +	lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
>> +
>
> So this here is why we export `prepare_commit_graph()`. How about we
> instead expose `bloom_filters_enabled()` that mirrors what we do in
> `generation_numbers_enabled()` and `corrected_commit_dates_enabled()`?
> That would both be on a higher level and do exactly what we want to
> achieve.

I've got another proposal, what if we let get_bloom_filter_settings()
call prepare_commit_graph()? Functions like
repo_find_commit_pos_in_graph() and lookup_commit_in_graph() do this
too. 

-- 
Cheers,
Toon




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux