Re: [PATCH v2] bloom: enable bloom filter with wildcard pathspec in revision traversal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Lidong Yan <yldhome2d2@xxxxxxxxx> writes:

>  static int forbid_bloom_filters(struct pathspec *spec)
>  {
> -	if (spec->has_wildcard)
> -		return 1;
> -	if (spec->magic & ~PATHSPEC_LITERAL)
> +	unsigned int allowed_magic =
> +		PATHSPEC_FROMTOP |
> +		PATHSPEC_MAXDEPTH |
> +		PATHSPEC_LITERAL |
> +		PATHSPEC_GLOB |
> +		PATHSPEC_ATTR;
> +
> +	if (spec->magic & ~allowed_magic)
>  		return 1;
>  	for (size_t nr = 0; nr < spec->nr; nr++)
> -		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
> +		if (spec->items[nr].magic & ~allowed_magic)
>  			return 1;

Quite straight-forward and easy to see that this is a simple
enhancement of the existing code.  Good.

> @@ -693,9 +698,22 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out,
>  	size_t len;
>  	int res = 0;
>  
> +	len = pi->nowildcard_len;
> +	if (len != pi->len) {
> +		/*
> +		 * for path like "/dir/file*", nowildcard part would be
> +		 * "/dir/file", but only "/dir" should be used for the

Leading "/" makes it look as if the pathspec element can begin with
a slash, but it can not, can it?

> +		 * bloom filter
> +		 */
> +		while (len > 0 && pi->match[len - 1] != '/')
> +			len--;

In a tree that has both "builtin/" directory and "builtin.h" file,
what pathspec_element do we get here when we run

	$ git log "builtin*"

We need to be able to catch commits that touch anything in
"builtin/" and also anything at the top-level that begins with
"builtin", like "builtin.h" and other things that might have existed
ever in the history.  I think len goes down to 0 and that is the
correct behaviour.

The code does deal with the case where len is reduced down to zero,
but a bit poorly.  It would allocate a zero-length string, then
realize it frees it, and returns -1.  As it must know how long the
resulting path is before it allocated, and by definition len is
pi->len if it did not have to allocate, it should be able to take
the "goto cleanup" code path before it even attempts to allocate,
and it should not have to do strlen().  But the current code is not
incorrect.

> +	}
>  	/* remove single trailing slash from path, if needed */
> -	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> -		path_alloc = xmemdupz(pi->match, pi->len - 1);
> +	if (len > 0 && pi->match[len - 1] == '/')
> +		len--;
> +
> +	if (len != pi->len) {
> +		path_alloc = xmemdupz(pi->match, len);
>  		path = path_alloc;
>  	} else
>  		path = pi->match;

> +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' '
> +	test_bloom_filters_used "-- A/\* file4" &&

We'd ask the Bloom filter: skip commits that you know that can never
be touching either "A/(anything)" or "file4".

> +	test_bloom_filters_used "-- A/file\*" &&

We'd ask the Bloom filter: skip commits that you know that can never
be touching either "A/".

> +	test_bloom_filters_used "-- * A/\*"

What do we ask?  The second one says that a commit that might touch
"A/(something)" is worth investigating, but what about the first
one?  

Ahhh, that one is not quoted, so the shell expands to existing files
(which presumably do not have any wildcard characters).  OK.  If the
lone * were quoted, we shouldn't be using the Bloom filter.

> +'
> +
> +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' '
>  	test_bloom_filters_not_used "-- file\*" &&

OK, this is exactly the case I wondered about in the above wrt "builtin*"
and I agree with the expectation of this test.

> -	test_bloom_filters_not_used "-- A/\* file4" &&
> -	test_bloom_filters_not_used "-- file4 A/\*" &&
> -	test_bloom_filters_not_used "-- * A/\*"
> +	test_bloom_filters_not_used "-- file\* A/\*" &&

This one I understand.  The first one reduces len down to 0 in the
wildcard stripping loop, and makes us say "a commit that might touch
any path is worth investigating", which amounts to the same thing as
not using the Bloom filter at all.

> +	test_bloom_filters_not_used "-- file\* *" &&

Ditto.  Even if the unquoted * may expand to many concrete paths,
"file\*" that can match any path that begins with "file" that ever
have existed in the history would not help skipping any commit.

> +	test_bloom_filters_not_used "-- \*"

Ditto.

> +'
> +
> +test_expect_success 'git log with path contains various magic signatures' '
> +	cd A &&
> +	test_bloom_filters_used "-- \:\(top\)B" &&
> +	cd .. &&
> +
> +	test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" &&
> +	test_bloom_filters_not_used "-- \:\(icase\)FILE4" &&
> +	test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" &&
> +
> +	test_when_finished "rm -f .gitattributes" &&
> +	cat >.gitattributes <<-EOF &&
> +	A/file1 text
> +	A/B/file2 -text
> +	EOF
> +	test_bloom_filters_used "-- \:\(attr\:text\)A"
>  '

OK.  Good to see negative test cases here, which we sometimes forget
to test when showing off our shiny new toy.

Taking what I suggested above, here is a possible improvement.

 revision.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git i/revision.c w/revision.c
index 2a5b98390e..2a92bdda84 100644
--- i/revision.c
+++ w/revision.c
@@ -696,14 +696,14 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out,
 	char *path_alloc = NULL;
 	const char *path;
 	size_t len;
-	int res = 0;
+	int res = -1; /* be pessimistic */
 
 	len = pi->nowildcard_len;
 	if (len != pi->len) {
 		/*
-		 * for path like "/dir/file*", nowildcard part would be
-		 * "/dir/file", but only "/dir" should be used for the
-		 * bloom filter
+		 * for path like "dir/file*", nowildcard part would be
+		 * "dir/file", but only "dir" should be used for the
+		 * bloom filter.
 		 */
 		while (len > 0 && pi->match[len - 1] != '/')
 			len--;
@@ -712,19 +712,17 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out,
 	if (len > 0 && pi->match[len - 1] == '/')
 		len--;
 
+	if (!len)
+		goto cleanup;
+
 	if (len != pi->len) {
 		path_alloc = xmemdupz(pi->match, len);
 		path = path_alloc;
 	} else
 		path = pi->match;
 
-	len = strlen(path);
-	if (!len) {
-		res = -1;
-		goto cleanup;
-	}
-
 	*out = bloom_keyvec_new(path, len, settings);
+	res = 0;
 
 cleanup:
 	free(path_alloc);




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux