Re: [PATCH 1/3] commit: convert pop_most_recent_commit() to prio_queue

René Scharfe <l.s.r@xxxxxx> · Wed, 16 Jul 2025 11:39:49 +0200

On 7/16/25 7:05 AM, Jeff King wrote:
> On Tue, Jul 15, 2025 at 04:51:07PM +0200, René Scharfe wrote:
> 
>> pop_most_recent_commit() calls commit_list_insert_by_date(), which and
>> is itself called in a loop, which can lead to quadratic complexity.
>> Replace the commit_list with a prio_queue to ensure logarithmic worst
>> case complexity and convert all three users.
> 
> I guess I'm cc'd because of my frequent complains about the quadratic
> nature of our commit lists? :)

And because you introduced prio_queue.

> Mostly I asked because I had to look at pop_most_recent_commit() to see
> what operation would be made faster here. Looks like it's mostly ":/",
> but maybe also fetch's mark_recent_complete_commits()? I guess we might
> hit that if you have a huge number of refs?

The :/ handling was the easiest to test for me.  fetch_pack() and
walker_fetch() require some server side to set up, which seems not worth
it just to demonstrate quadratic behavior.  Having thousands of refs
would make the list long enough to notice, as would having lots of
merges.

My general idea is to get rid of commit_list_insert_by_date() eventually
to avoid quadratic complexity.

> I actually have a series turning rev_info.commits into a prio_queue
> which I need to polish up (mostly just writing commit messages; I've
> been running with it for almost 2 years without trouble). Ironically it
> does not touch this spot, as these commit lists are formed on their own.

That is not a coincidence.  I had a look at that series and tried to
reach its goals while keeping rev_info.commits a commit_list.  Why?
Mostly being vaguely uncomfortable with prio_queue' memory overhead,
lack of type safety and dual use as a stack.  I still used it, but only
as local variable, not in the central struct rev_info.

Anyway, I failed; revision.c::get_revision_1() took an 8% performance
hit in my versions and none in yours, and I couldn't figure out why.
Perhaps I should revisit it with the new prio_queue_replace() in hand,
hmm..  For now I try to salvage the commit_list_insert_by_date()
replacement work outside of revision.c from that effort. 

> The patch itself looks reasonable. I think here:
> 
>> @@ -1461,7 +1462,7 @@ static int get_oid_oneline(struct repository *r,
>>  			   const char *prefix, struct object_id *oid,
>>  			   const struct commit_list *list)
>>  {
>> -	struct commit_list *copy = NULL, **copy_tail = &copy;
>> +	struct prio_queue copy = { compare_commits_by_commit_date };
>>  	const struct commit_list *l;
>>  	int found = 0;
>>  	int negative = 0;
>> @@ -1483,9 +1484,9 @@ static int get_oid_oneline(struct repository *r,
>>  
>>  	for (l = list; l; l = l->next) {
>>  		l->item->object.flags |= ONELINE_SEEN;
>> -		copy_tail = &commit_list_insert(l->item, copy_tail)->next;
>> +		prio_queue_put(&copy, l->item);
>>  	}
>> -	while (copy) {
>> +	while (copy.nr) {
>>  		const char *p, *buf;
>>  		struct commit *commit;
>>  		int matches;
> 
> our callers are always generating and passing in a list. So we could
> avoid the overhead of allocating the list in the first place by just
> taking a prio_queue. But maybe it gets weird with clearing the
> ONELINE_SEEN marks? We make a copy even in the current code so that we
> can call clear_commit_marks() on the complete set.
> 
> I guess we could add them to an array or something, but that probably
> ends up being roughly the same amount of work.
Right, if get_oid_oneline() gets handed a prio_queue, it would need to
copy all commits to a list or array for marks clearing at the end.  I
don't see a way around having both.  And I doubt there would be much of
a difference between list and array here, but could be convinced by
numbers. ;)

>> +build_history () {
>> +	local max_level="$1" &&
>> +	local level="${2:-1}" &&
>> +	local mark="${3:-1}" &&
>> +	if test $level -eq $max_level
>> +	then
>> +		echo "reset refs/heads/master" &&
>> +		echo "from $ZERO_OID" &&
>> +		echo "commit refs/heads/master" &&
>> +		echo "mark :$mark" &&
>> +		echo "committer C <c@xxxxxxxxxxx> 1234567890 +0000" &&
>> +		echo "data <<EOF" &&
>> +		echo "$mark" &&
>> +		echo "EOF"
>> +	else
>> +		local level1=$((level+1)) &&
>> +		local mark1=$((2*mark)) &&
>> +		local mark2=$((2*mark+1)) &&
>> +		build_history $max_level $level1 $mark1 &&
>> +		build_history $max_level $level1 $mark2 &&
>> +		echo "commit refs/heads/master" &&
>> +		echo "mark :$mark" &&
>> +		echo "committer C <c@xxxxxxxxxxx> 1234567890 +0000" &&
>> +		echo "data <<EOF" &&
>> +		echo "$mark" &&
>> +		echo "EOF" &&
>> +		echo "from :$mark1" &&
>> +		echo "merge :$mark2"
>> +	fi
>> +}
> 
> This took some brain cycles to decipher. It looks like we'll make
> 2^$level commits in a filled tree? It might be worth a brief comment
> describing the goal (and maybe even giving an example graph drawing for
> N=3 or something, though it gets out of hand quickly).

The goal is to have lots of merges, to make the list that
pop_most_recent_commit() works on long (each pop adds back two parents).

The script builds a binary tree of merges, with root commits (without
parents) as leaf nodes.  (The nomenclature is a bit weird because we
call the child nodes parent commits.)  So it creates 2^($level-1) root
commits and 2^($level-1)-1 two-way merges stacked on top.

    _1_
   /   \
  2     3
 / \   / \
4   5 6   7

The numbers are the marks; 1 to 3 are merges (have two parents), 4 to 7
are root commits.

>> +test_perf "rev-parse ':/$(cat needle)'" "
>> +	git rev-parse ':/$(cat needle)' >actual
>> +"
> 
> Hmm, usually we frown on putting snippets inside double-quotes because
> it's so easy to accidentally interpolate outside of the test_eval. But
> maybe this is short enough to be OK. I guess you did it here especially
> so that the title is a nice ":/65535" and not the opaque "$(cat
> needle)".
The first one allowed me to check whether the setup step created the
expected number of commits and seems kind of interesting for people
running the test.  The second one could be switched to be evaluated at
test_eval time, but that would make it harder to see that title and
test do the same.  We could strip out the quotes:

   test_perf "rev-parse :/$(cat needle)" '
           git rev-parse :/$(cat needle) >actual
   '

René