Re: [PATCH v5 3/6] last-modified: use Bloom filters when available

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 16, 2025 at 03:35:15PM +0200, Toon Claes wrote:
> Our 'git last-modified' performs a revision walk, and computes a diff at
> each point in the walk to figure out whether a given revision changed
> any of the paths it considers interesting.
>
> When changed-path Bloom filters are available, we can avoid computing
> many such diffs. Before computing a diff, we first check if any of the
> remaining paths of interest were possibly changed at a given commit by
> consulting its Bloom filter. If any of them are, we are resigned to
> compute the diff.
>
> If none of those queries returned "maybe", we know that the given commit
> doesn't contain any changed paths which are interesting to us. So, we
> can avoid computing it in this case.
>
> Comparing the perf test results on git.git:
>
>     Test                                        HEAD~             HEAD
>     ------------------------------------------------------------------------------------
>     8020.1: top-level last-modified             4.49(4.34+0.11)   2.22(2.05+0.09) -50.6%
>     8020.2: top-level recursive last-modified   5.64(5.45+0.11)   5.62(5.30+0.11) -0.4%
>     8020.3: subdir last-modified                0.11(0.06+0.04)   0.07(0.03+0.04) -36.4%

As an aside on 8020.3 (that I probably should have mentioned in the last
commit), I think that our "| head -n1" heuristic for picking a sub-tree
is skewing these results down. In git.git, the lexicographically
earliest sub-tree is ".github", which is awfully tiny. I wonder if we
should be grabbing the *last* sub-tree, or maybe the largest one by
count of entries?

> @@ -40,6 +43,12 @@ struct last_modified {
>
>  static void last_modified_release(struct last_modified *lm)
>  {
> +	struct hashmap_iter iter;
> +	struct last_modified_entry *ent;
> +
> +	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
> +		clear_bloom_key(&ent->key);
> +

I did a double-take to make sure that ent->key would always be
initialized here, but it is thanks to the FLEX_ALLOC_STR() call below.

>  	hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
>  	release_revisions(&lm->rev);
>  }
> @@ -67,6 +76,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
>
>  		FLEX_ALLOC_STR(ent, path, path);
>  		oidcpy(&ent->oid, &p->two->oid);
> +		if (lm->rev.bloom_filter_settings)
> +			fill_bloom_key(path, strlen(path), &ent->key,
> +				       lm->rev.bloom_filter_settings);
>  		hashmap_entry_init(&ent->hashent, strhash(ent->path));
>  		hashmap_add(&lm->paths, &ent->hashent);
>  	}
> @@ -126,6 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
>  		data->callback(path, data->commit, data->callback_data);
>
>  	hashmap_remove(data->paths, &ent->hashent, path);
> +	clear_bloom_key(&ent->key);

OK, we're calling clear_bloom_key() here, too, but it uses
FREE_AND_NULL(), so calling it again in last_modified_release() may be a
noop, but won't ever be a double-free.

> @@ -238,6 +276,13 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
>  		return argc;
>  	}
>
> +	/*
> +	 * We're not interested in generation numbers here,
> +	 * but calling this function to prepare the commit-graph.
> +	 */
> +	(void)generation_numbers_enabled(lm->rev.repo);
> +	lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);

Hmmph. I think when I originally wrote this I was using the side-effect
of calling generation_numbers_enabled() as a hack. But I think that it
may be worth making "prepare_commit_graph()" a non-static function and
calling that instead.

Thanks,
Taylor




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux