Re: [PATCH v5 3/6] last-modified: use Bloom filters when available

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Taylor Blau <me@xxxxxxxxxxxx> writes:

> On Wed, Jul 16, 2025 at 03:35:15PM +0200, Toon Claes wrote:
>> Our 'git last-modified' performs a revision walk, and computes a diff at
>> each point in the walk to figure out whether a given revision changed
>> any of the paths it considers interesting.
>>
>> When changed-path Bloom filters are available, we can avoid computing
>> many such diffs. Before computing a diff, we first check if any of the
>> remaining paths of interest were possibly changed at a given commit by
>> consulting its Bloom filter. If any of them are, we are resigned to
>> compute the diff.
>>
>> If none of those queries returned "maybe", we know that the given commit
>> doesn't contain any changed paths which are interesting to us. So, we
>> can avoid computing it in this case.
>>
>> Comparing the perf test results on git.git:
>>
>>     Test                                        HEAD~             HEAD
>>     ------------------------------------------------------------------------------------
>>     8020.1: top-level last-modified             4.49(4.34+0.11)   2.22(2.05+0.09) -50.6%
>>     8020.2: top-level recursive last-modified   5.64(5.45+0.11)   5.62(5.30+0.11) -0.4%
>>     8020.3: subdir last-modified                0.11(0.06+0.04)   0.07(0.03+0.04) -36.4%
>
> As an aside on 8020.3 (that I probably should have mentioned in the last
> commit), I think that our "| head -n1" heuristic for picking a sub-tree
> is skewing these results down. In git.git, the lexicographically
> earliest sub-tree is ".github", which is awfully tiny. I wonder if we
> should be grabbing the *last* sub-tree, or maybe the largest one by
> count of entries?

Ah yes, that's automatable. Good suggestion.

>
>> @@ -40,6 +43,12 @@ struct last_modified {
>>
>>  static void last_modified_release(struct last_modified *lm)
>>  {
>> +	struct hashmap_iter iter;
>> +	struct last_modified_entry *ent;
>> +
>> +	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
>> +		clear_bloom_key(&ent->key);
>> +
>
> I did a double-take to make sure that ent->key would always be
> initialized here, but it is thanks to the FLEX_ALLOC_STR() call below.
>
>>  	hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
>>  	release_revisions(&lm->rev);
>>  }
>> @@ -67,6 +76,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
>>
>>  		FLEX_ALLOC_STR(ent, path, path);
>>  		oidcpy(&ent->oid, &p->two->oid);
>> +		if (lm->rev.bloom_filter_settings)
>> +			fill_bloom_key(path, strlen(path), &ent->key,
>> +				       lm->rev.bloom_filter_settings);
>>  		hashmap_entry_init(&ent->hashent, strhash(ent->path));
>>  		hashmap_add(&lm->paths, &ent->hashent);
>>  	}
>> @@ -126,6 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
>>  		data->callback(path, data->commit, data->callback_data);
>>
>>  	hashmap_remove(data->paths, &ent->hashent, path);
>> +	clear_bloom_key(&ent->key);
>
> OK, we're calling clear_bloom_key() here, too, but it uses
> FREE_AND_NULL(), so calling it again in last_modified_release() may be a
> noop, but won't ever be a double-free.
>
>> @@ -238,6 +276,13 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
>>  		return argc;
>>  	}
>>
>> +	/*
>> +	 * We're not interested in generation numbers here,
>> +	 * but calling this function to prepare the commit-graph.
>> +	 */
>> +	(void)generation_numbers_enabled(lm->rev.repo);
>> +	lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
>
> Hmmph. I think when I originally wrote this I was using the side-effect
> of calling generation_numbers_enabled() as a hack. But I think that it
> may be worth making "prepare_commit_graph()" a non-static function and
> calling that instead.

Well thanks for confirming this assumption. I wasn't 100%, and I agree
having a non-static "prepare_commit_graph()" would be better.

Somewhat related to that. In your branch you have this line in
"maybe_changed_path()": 

	if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
		return 1;

I excluded this line because if we ignore the result of
"generation_numbers_enabled()", why would it matter to have a generation
number?

The thing is "maybe_changed_path()" in blame.c also has this line too,
but unfortunately git-blaming that line didn't learn me anything why
it's there. Do you have any idea?

-- 
Cheers,
Toon




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux