Re: [PATCH 2/3] path-walk: fix setup of pending objects

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 21 Aug 2025 16:33:21 -0400

On 8/21/2025 4:01 AM, Patrick Steinhardt wrote:
> On Wed, Aug 20, 2025 at 06:39:55PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <stolee@xxxxxxxxx>
>>
>> The previous change established a buggy instance of 'git repack -adf
>> --path-walk' when there exist paths that are tracked in the index and
>> that is the only instance of those paths in the history of the
>> repository. This change fixes that bug.
>>
>> The core problem here is that the "maybe_interesting" member of 'struct
>> type_and_oid_list' is not initialized to '1'. This member was added in
>> 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING,
>> 2024-12-20) in a way to help when creating packfiles for a small commit
>> range using the sparse path algorithm (enabled by pack.useSparse=true).
>>
>> The idea here is that the list is marked as "maybe_interesting" if an
>> object is added that does not have the UNINITERSTING flag on it. Later,
> 
> s/UNINITERSTING/UNINTERESTING/

Thanks!

>> this is checked again in case all objects in the list were marked
>> UNINTERESTING after that point in time. In this case, the algorithm
>> skips the list as there is no reason to visit it.
>>
>> This leads to the problem where the "maybe_interesting" member was not
>> appropriately initialized when the list is created from pending objects.
>> This is the fix for now.
>>
>> To help avoid this from happening in the future, a follow-up change will
>> make initializing lists use a shared method instead of allowing for an
>> update to this initialization process to miss some existing copies.
> 
> Yeah, I wanted to say that this feels quite fragile to me and very easy
> to miss. Does this mechanism buy us a lot of performance in the first
> place? Because if not we might as well just remove it entirely.

The details for the space savings and moderate time cost are listed in
e5394794a5 (pack-objects: thread the path-based compression, 2025-05-16).
These improvements are on top of those from --name-hash-version=2 by
using a different way to focus delta calculations.

A larger, internal example for this can be seen in this table (based on
testing today):

Mode      |  Size    | Time
----------+----------+------
version 1 |  16.0 GB | 83min
version 2 |   9.9 GB | 77min
path walk |   6.4 GB | 74min

> But if the answer is "yes" then adding APIs around it feels like a good
> alternative.
Making the code less brittle to changes is good, but also I'm interested
in ways to improve our test infrastructure or adding defensive features.
I mentioned a couple of ideas in an earlier message.

Thanks,
-Stolee