Re: [PATCH] commit: avoid parent list buildup in clear_commit_marks_many()

Patrick Steinhardt <ps@xxxxxx> · Wed, 12 Feb 2025 08:05:34 +0100

On Sun, Feb 09, 2025 at 11:41:15AM +0100, René Scharfe wrote:
> clear_commit_marks_1() clears the marks of the first parent and its
> first parent and so on, and saves the higher numbered parents in a list
> for later.  There is no benefit in keeping that list growing with each
> handled commit.  Clear it after each run to reduce peak memory usage.

Okay. So the issue here is that `clear_commit_marks_1()` only processes
the first-parent chain of commits, and any other commits will be added
to the `struct commit_list` backlog. Consequently, merge-heavy histories
are very likely to build up quite a backlog of non-first-parent commits.

> Signed-off-by: René Scharfe <l.s.r@xxxxxx>
> ---
>  commit.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/commit.c b/commit.c
> index 540660359d..6efdb03997 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -780,14 +780,14 @@ static void clear_commit_marks_1(struct commit_list **plist,
> 
>  void clear_commit_marks_many(size_t nr, struct commit **commit, unsigned int mark)
>  {
> -	struct commit_list *list = NULL;
> -
>  	for (size_t i = 0; i < nr; i++) {
> +		struct commit_list *list = NULL;
> +
>  		clear_commit_marks_1(&list, *commit, mark);
> +		while (list)
> +			clear_commit_marks_1(&list, pop_commit(&list), mark);
>  		commit++;
>  	}
> -	while (list)
> -		clear_commit_marks_1(&list, pop_commit(&list), mark);
>  }

And this is fixed by immediately processing all commits that we
currently have in the list. As `clear_commit_marks_1()` only processes
immediate children of the handed-in commit we know that it will have
processed the first parent, and `list` will contain remaining commits,
if any.

We also end up adding grandchildren to the list, so this change
essentially converts the algorithm from depth-first to breadth-first. I
bet we can construct cases where this will perform _worse_ than the
current algorithm, e.g. when you have branch thickets where every commit
is a merge: But I assume that for the most common cases this might be an
improvement indeed.

The question to me is: does this actually matter in the real world? It
would be nice to maybe get some numbers that demonstrate the improvement
in a repository.

Patrick