Re: [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 7 Jul 2025 07:35:42 -0400

On 7/4/2025 7:14 AM, Lidong Yan wrote:
> The revision traversal limited by pathspec has optimization when
> the pathspec has only one element. To support optimization for
> multiple pathspec items, we need to modify the data structures
> in struct rev_info.

You are correct that the revision-walking abandons bloom filters
when there are multiple pathspecs, and fixing this is a valuable
effort.

The need for this change is subtle and could use some extra context
to be sure reviewers understand:

This change is writing over some code that was created in
c525ce95b46 (commit-graph: check all leading directories in changed
path Bloom filters, 2020-07-01) to allow storing multiple bloom
keys during the revision walk. The multiple keys are focusing on
multiple path components of the literal pathspec. 
The point is that after the initialization, the bloom key array is
used directly as a filter for reporting TREESAME commits: a commit
is automatically reported as TREESAME to its first parent if any
bloom key results in a "No, definitely not changed" result with
that commit's bloom filter.

The reason we need a new data structure is that we need to adjust
the conditionals.

BEFORE: "NOT TREESAME if there EXISTS a bloom key that reports NO"

AFTER: "NOT TREESAME if FOR EVERY pathspec there EXISTS a bloom key
        that reports NO."

This "FOR EVERY" condition makes it impossible to use a flat array
of bloom keys for multiple pathspecs, justifying this change.

What is further confusing here is that we already have logic that
deals with arrays of bloom keys, so I expected that the vector was
the single structure storing a list of those arrays. Instead, the
vector is replacing the array itself. This is made clear by using
the vector immediately in the existing implementation.
> +struct bloom_keyvec *bloom_keyvec_new(size_t count)
> +{
> +	struct bloom_keyvec *vec;
> +	size_t sz = sizeof(struct bloom_keyvec);
> +	sz += count * sizeof(struct bloom_key);
> +	vec = (struct bloom_keyvec *)xcalloc(1, sz);
You could use CALLOC_ARRAY() to simplify this and drop
the 'sz' variable.

> +	vec->count = count;
> +	return vec;
> +}
> +
> +void bloom_keyvec_free(struct bloom_keyvec *vec)
> +{
> +	if (!vec)
> +		return;
> +	for (size_t nr = 0; nr < vec->count; nr++)
> +		bloom_key_clear(&vec->key[nr]);
> +	free(vec);
> +}
> +
>  static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
>  		       const struct hashmap_entry *eptr,
>  		       const struct hashmap_entry *entry_or_key,
> @@ -541,6 +560,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
>  	return 1;
>  }
>  
> +int bloom_filter_contains_vec(const struct bloom_filter *filter,
> +			      const struct bloom_keyvec *vec,
> +			      const struct bloom_filter_settings *settings)
> +{
> +	int ret = 1;
> +
> +	for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
> +		ret = bloom_filter_contains(filter, &vec->key[nr], settings);
> +
> +	return ret;
> +}

This implementation is where the subtle detail comes in. Might be worth
a comment to say "if any key in this list is not contained in the filter,
then the filter doesn't match this vector."