Re: [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec

Lidong Yan <502024330056@xxxxxxxxxxxxxxxx> · Mon, 7 Jul 2025 22:14:22 +0800

Derrick Stolee <stolee@xxxxxxxxx> wrote:
> 
> BEFORE: "NOT TREESAME if there EXISTS a bloom key that reports NO"
> 
> AFTER: "NOT TREESAME if FOR EVERY pathspec there EXISTS a bloom key
>        that reports NO."
> 
> This "FOR EVERY" condition makes it impossible to use a flat array
> of bloom keys for multiple pathspecs, justifying this change.
> 
> What is further confusing here is that we already have logic that
> deals with arrays of bloom keys, so I expected that the vector was
> the single structure storing a list of those arrays. Instead, the
> vector is replacing the array itself. This is made clear by using
> the vector immediately in the existing implementation.

I think the problem here is that it clearly enough in my comments.
When writing the code, I thought about converting the original
one-dimensional array struct bloom_key *keys into a two-dimensional
array struct bloom_key **. Then I came up with the idea that this
two-dimensional array could be designed as struct bloom_keyvec *keyvecs,
which might be clearer. Each struct bloom_keyvec would represent all the
bloom_key elements for a single pathspec item.

>> +struct bloom_keyvec *bloom_keyvec_new(size_t count)
>> +{
>> + struct bloom_keyvec *vec;
>> + size_t sz = sizeof(struct bloom_keyvec);
>> + sz += count * sizeof(struct bloom_key);
>> + vec = (struct bloom_keyvec *)xcalloc(1, sz);
> You could use CALLOC_ARRAY() to simplify this and drop
> the 'sz' variable.

You are right. I think you suggest to write struct bloom_keyvec like:

———————-
|        count        |
———————
|         *keys       |  ——>     key0 | key1 | key2 | … |
———————

And I am doing here makes struct bloom_keyvec looks like

———————
|       count        |
———————
|        key[0]      |
———————
|        key[1]      |
———————
|        …            |

Although bloom_keyvec_new() appears more complex, the advantage
is that bloom_keyvec_destroy() no longer needs to free keys manually.
And if I understand correctly, junio had suggested to use the second way
[here](https://lore.kernel.org/git/xmqqtt43u36t.fsf@gitster.g/).

> 
>> + vec->count = count;
>> + return vec;
>> +}
>> +
>> +void bloom_keyvec_free(struct bloom_keyvec *vec)
>> +{
>> + if (!vec)
>> + return;
>> + for (size_t nr = 0; nr < vec->count; nr++)
>> + bloom_key_clear(&vec->key[nr]);
>> + free(vec);
>> +}
>> +
>> static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
>>       const struct hashmap_entry *eptr,
>>       const struct hashmap_entry *entry_or_key,
>> @@ -541,6 +560,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
>> return 1;
>> }
>> 
>> +int bloom_filter_contains_vec(const struct bloom_filter *filter,
>> +      const struct bloom_keyvec *vec,
>> +      const struct bloom_filter_settings *settings)
>> +{
>> + int ret = 1;
>> +
>> + for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
>> + ret = bloom_filter_contains(filter, &vec->key[nr], settings);
>> +
>> + return ret;
>> +}
> 
> This implementation is where the subtle detail comes in. Might be worth
> a comment to say "if any key in this list is not contained in the filter,
> then the filter doesn't match this vector."

I will add comment in the next version

Thank you for your review,
Lidong