On 8/21/2025 4:01 AM, Patrick Steinhardt wrote: > On Wed, Aug 20, 2025 at 06:39:55PM +0000, Derrick Stolee via GitGitGadget wrote: >> From: Derrick Stolee <stolee@xxxxxxxxx> >> >> The previous change established a buggy instance of 'git repack -adf >> --path-walk' when there exist paths that are tracked in the index and >> that is the only instance of those paths in the history of the >> repository. This change fixes that bug. >> >> The core problem here is that the "maybe_interesting" member of 'struct >> type_and_oid_list' is not initialized to '1'. This member was added in >> 6333e7ae0b (path-walk: mark trees and blobs as UNINTERESTING, >> 2024-12-20) in a way to help when creating packfiles for a small commit >> range using the sparse path algorithm (enabled by pack.useSparse=true). >> >> The idea here is that the list is marked as "maybe_interesting" if an >> object is added that does not have the UNINITERSTING flag on it. Later, > > s/UNINITERSTING/UNINTERESTING/ Thanks! >> this is checked again in case all objects in the list were marked >> UNINTERESTING after that point in time. In this case, the algorithm >> skips the list as there is no reason to visit it. >> >> This leads to the problem where the "maybe_interesting" member was not >> appropriately initialized when the list is created from pending objects. >> This is the fix for now. >> >> To help avoid this from happening in the future, a follow-up change will >> make initializing lists use a shared method instead of allowing for an >> update to this initialization process to miss some existing copies. > > Yeah, I wanted to say that this feels quite fragile to me and very easy > to miss. Does this mechanism buy us a lot of performance in the first > place? Because if not we might as well just remove it entirely. The details for the space savings and moderate time cost are listed in e5394794a5 (pack-objects: thread the path-based compression, 2025-05-16). These improvements are on top of those from --name-hash-version=2 by using a different way to focus delta calculations. A larger, internal example for this can be seen in this table (based on testing today): Mode | Size | Time ----------+----------+------ version 1 | 16.0 GB | 83min version 2 | 9.9 GB | 77min path walk | 6.4 GB | 74min > But if the answer is "yes" then adding APIs around it feels like a good > alternative. Making the code less brittle to changes is good, but also I'm interested in ways to improve our test infrastructure or adding defensive features. I mentioned a couple of ideas in an earlier message. Thanks, -Stolee