Re: [PATCH 2/3] path-walk: fix setup of pending objects

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 21 Aug 2025 08:55:48 -0400

On 8/21/2025 4:01 AM, Patrick Steinhardt wrote:
> On Wed, Aug 20, 2025 at 03:42:11PM -0400, Derrick Stolee wrote:
>> On 8/20/2025 3:02 PM, Junio C Hamano wrote:
>>> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

>> The key integration point is the "pending" list operating a bit
>> different from walking directly from tags or commits. I was trying
>> to reproduce the issue from all of those other sources before unlocking
>> the "singleton" nature of the problem, and failed to do so.
>>
>> The resolve-undo cache (REUC) is something that I had not tested
>> previously. Adding "git rm --cached x/y" to the test in the previous
>> case leads to the 'git fsck' call giving a "dangling blob" warning,
>> so that could be an interesting way to strengthen the test. Thanks,
> 
> I also wonder a bit about the future -- if we ever add a new source for
> pending objects, would the author have to amend "path-walk.c" to take
> this new pending source into account?

I don't think the risk comes from new ways to add pending objects,
but that the path-walk algorithm added an "optimization" without
appropriately modifying how it handled the pending objects. New
sources of pending objects shouldn't cause any issues, unless the
revision API itself changed substantially.

For example, in my draft for v2 I add changes to my test to account
for the REUC in the repack. The new logic for the path-walk feature
worked just fine. It does point out that 'git fsck' doesn't follow
links in the REUC and so reports the blob as dangling.

If we have a new source for pending objects in the context of
pack-objects or repack, one would hope that tests would be added to
demonstrate the behavior and then that could be double-checked with
GIT_TEST_PACK_PATH_WALK=1. One problem in this situation is that
the tests were not substantial enough for these sources. We are
correcting for that now. 
> I guess the answer is "yes", which does make me feel a bit uneasy as
> it is very easy to now corrupt the repository.

Perhaps this has always been very easy to make a bug that leads to
losing objects, but we're purposefully wary to touch the logic around
the repacking processes.

One feature that could provide increased confidence is some step that
double-checks the object list before and after a step like this to
make sure that no objects are dropped (or only the set we expect is
dropped). Something like comparing the object lists in the pack-indexes
before and after a repack. This might be too problematic to enable in
all cases, but could be enabled in the test suite and certain critical
places. Azure DevOps has something similar in its backend to prevent
any change to the Git object database that can't be undone, but these
steps happen asynchronously to "production data" so it may not be
appropriate for all server architectures.

Thanks,
-Stolee