Re: [PATCH 0/2] bloom: use bloom filter given multiple pathspec

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 25 Jun 2025 10:32:10 -0700

Lidong Yan <yldhome2d2@xxxxxxxxx> writes:

> git won't use bloom filter for multiple pathspec, which makes the command

Let's get the terminology straight.  A pathspec consists of one or
more pathspec elements (or pathspec items).

Also, "git won't" is overly general.  The series title shares the
same issue ("given multiple pathspec" does not even hint that this
is about revision traversal---you are not making filter used with
pathspec with more than one element in other code paths).

Perhaps like:

    The revision traversal limited by pathspec has optimization when
    the pathspec has only one element, it does not use any pathspec
    magic (other than literal), and there is no wildcard.

    While it is much harder to lift the latter two limitations,
    supporting a pathspec with multiple elements is relatively easy.
    Just make sure we hash each of them separately and ask the bloom
    filter about them, and if we see none of them can possibly be
    affected by the commit, we can skip without tree comparison.

or something along that line?

>   git log -- file1 file2
> significantly slower than
>   git log -- file1 && git log -- file2
>
> This issue is raised by Kai Koponen at
>   https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@xxxxxxxxxxxxxx/
>
> To fix this, revs->bloom_keys[] needs to become an array of bloom_keys[],
> one for each literal pathspec element. For convenience, first commit
> creates a new struct bloom_keyvec to hold all bloom keys for a single
> pathspec. The second commit add for loop to check if any pathspec's keyvec
> is contained in a commit's bloom filter, along with code that initialize
> destory and test multiple pathspec bloom keyvecs.

It is nice to outline an approach to the solution one day, and see
an almost exact implementation of it appear on the list a few days
later.  I wish all the development would go like this ;-)

>  bloom.c              |  47 +++++++++++++++++
>  bloom.h              |  14 +++++
>  revision.c           | 121 ++++++++++++++++++++++++-------------------
>  revision.h           |   5 +-
>  t/t4216-log-bloom.sh |  10 ++--

Can we have a set of real tests to make sure that the updated filter
code still identifies commits that touch the files without false
negatives?  False positives are OK as we will follow them with real
tree comparison to determine what exactly got changed, but false
negatives are absolute no-no.

Testing to see that the filter code path is activated is much less
interesting than the filter code path still functions correctly with
these changes presented here.  I have a feeling that with the
changes to the test in this series, you wouldn't even find a bug
where you simply added subpaths for all pathspec elements into a
single array and use the original "bloom has to say 'possibly yes'
to all array elements" logic (which would incorrectly require that
both file1 and file2 must be modified).

Thanks.