Re: [PATCH RFC v2 4/5] last-modified: implement faster algorithm

Patrick Steinhardt <ps@xxxxxx> · Tue, 27 May 2025 12:39:48 +0200

On Fri, May 23, 2025 at 11:33:51AM +0200, Toon Claes wrote:
> The current implementation of 'git last-modified' works by doing a
> revision walk, and inspecting the diff at each level of that walk to
> annotate the to-be-found entries to a path. In other words, if the diff
> at some level touches a path which has not yet been associated with a
> commit, then that commit becomes associated with the path.

It's a bit funny that we first introduce an algorithm, only to change it
in a subsequent step again. I don't mind it much though: this variant
here is quite a bit more complicated, and it's nice to first gain an
understanding of how the simple algorithm works before going to the
harder one.

It's also nice that the tests prove that this is indeed leads to the
same result.

[snip]
> More specifically, consider a priority queue of commits sorted by
> generation number. First, enqueue the set of boundary commits with all
> paths in the original spec marked as interesting.
> 
> Then, while the queue is not empty, do the following:
> 
>   1. Pop an element, say, 'c', off of the queue, making sure that 'c'
>      isn't reachable by anything in the '--not' set.
> 
>   2. For each parent 'p' (with index 'parent_i') of 'c', do the
>      following:
> 
>      a. Compute the diff between 'c' and 'p'.
>      b. Pass any active paths that are TREESAME from 'c' to 'p'.
>      c. If 'p' has any active paths, push it onto the queue.

What if an active path is changed on both sides of a merge commit? Do we
pass it to the first parent?

[snip]
> Now, some performance numbers. On github/git, our numbers look like the
> following (all wall-clock times best-of-five, and with '--max-depth=0'
> on the root):

This option does not exist in this version of git-last-modified(1).

>                  github		ttaylorr/blame-tree-fast
>    with filters: 0.754s		0.271s (2.78x faster, 6.18x overall)
> without filters: 1.676s		1.056s (1.58x faster)
> 
> and on torvalds/linux:
> 
>                  github		ttaylorr/blame-tree-fast
>    with filters: 0.608		0.062 (9.81x faster, ~52x overall)
> without filters: 3.251		0.676 (4.81x faster)
> 
> In short, the existing implementation is comparably fast *with* filters
> as the new implementation is *without* filters. So, most repositories
> should get a dramatic speed-up by just deploying this (even without
> computing Bloom filters), and all repositories should get faster still
> when computing Bloom filters.

It would be nice to introduce "filters" as "Bloom filters". I was
initially wondering what filters you talk about until you then mention
it in the last sentence.

> diff --git a/last-modified.c b/last-modified.c
> index f628434929..0a0818cdf1 100644
> --- a/last-modified.c
> +++ b/last-modified.c
> @@ -3,18 +3,20 @@
>  #include "commit.h"
>  #include "diffcore.h"
>  #include "diff.h"
> -#include "object.h"
>  #include "revision.h"
>  #include "repository.h"
>  #include "log-tree.h"
>  #include "dir.h"
>  #include "commit-graph.h"
>  #include "bloom.h"
> +#include "prio-queue.h"
> +#include "commit-slab.h"

It would be nice if we could keep these lexicographically sorted right
from the first commit.

> @@ -116,6 +128,20 @@ void last_modified_release(struct last_modified *lm)
>  	}
>  	hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
>  	release_revisions(&lm->rev);
> +	free(lm->all_paths);
> +}
> +
> +struct commit_active_paths {
> +	char *active;
> +	int nr;

Should this be a `size_t` as it is counting something?

> +};

Hm, a bit weird, as I don't kno what `nr` is supposed to stand for.
Intuitively I would have expected that `active` is an array of strings,
and that `nr` tracks how many there are. But that's not the case.

Let's read on.

> +define_commit_slab(active_paths, struct commit_active_paths);
> +static struct active_paths active_paths;
> +
> +static void free_one_active_path(struct commit_active_paths *active)

s/free_one_active_path/commit_active_paths_release/

> @@ -197,7 +230,32 @@ static void last_modified_diff(struct diff_queue_struct *q,
>  	}
>  }
>  
> -static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> +static char *scratch;

Having a global variable like this in a library is not great. Can we
instead pass around a context or something like this internally?

> +
> +static void pass_to_parent(struct commit_active_paths *c,
> +			   struct commit_active_paths *p,
> +			   int i)
> +{
> +	c->active[i] = 0;
> +	c->nr--;
> +	p->active[i] = 1;
> +	p->nr++;

Okay, so `active` is a bitfield. It's a bit weird that we use a full
byte for each bit though. It might not matter much in practice, but it
feels quite wasteful to me. Doubly so because we allocate this bitfield
for every commit we visit.

Can we maybe instead use `struct bitmap` for this?

> +}
> +
> +#define PARENT1 (1u<<16) /* used instead of SEEN */
> +#define PARENT2 (1u<<17) /* used instead of BOTTOM, BOUNDARY */

These are the same definitions as in "commit-reach.c". Might be worth it
to deduplicate those.

> @@ -221,8 +281,88 @@ static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
>  	return 0;
>  }
>  
> +static int process_parent(struct last_modified *lm, struct prio_queue *queue,
> +			  struct commit *c,
> +			  struct commit_active_paths *active_c,
> +			  struct commit *parent, int parent_i)
> +{
> +	int i, ret = 0; // TODO type & for loop var

This looks like a left-over comment that should be addressed.

Patrick