Re: [PATCH 2/3] sparse-checkout: add 'clean' command

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/9/2025 1:35 PM, Elijah Newren wrote:
> On Wed, Jul 9, 2025 at 9:13 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>>
>> On 7/8/2025 5:43 PM, Elijah Newren wrote:

>>>> +                                                    This subcommand requires cone-mode
>>>> +       sparse-checkout to be sure that we know which directories are
>>>> +       both tracked and all contained paths are not in the sparse-checkout.
>>>> +       This command can be used to be sure the sparse index works
>>>> +       efficiently.
>>>
>>> So...what does it do when in cone mode and the sparse index is not enabled?
>>
>> It doesn't effect the behavior, since we don't care about the on-disk
>> format and instead use an in-memory sparse index to determine which
>> directories to delete.
>>
>> There could be a benefit for users wanting to clean up extra files in
>> their worktree even if they are not using a sparse index. It is less
>> likely that they will discover that they are in that state if they
>> are not pestered by the index expansion advice message.
> 
> Right, for cone mode without the sparse index turned on, this new
> subcommand seems to be a silent no-op (other than burning some
> computation time),

No, it works without the sparse index off. The sparse index config
is about whether or not Git _writes_ the index in the sparse format.

This command works even if the sparse index is not enabled for the
written format, since we can manipulate the in-memory index for the
purpose of discovering which tracked directories should be sparse
and thus not in the worktree.

> despite the fact that the wording in the manual> might lead users to believe it will do some nice tidying for them.
> When the command spends some time working but doesn't do anything and
> doesn't report anything, the user might then think the command is just
> buggy.  I think the command should probably either (a) do some tidying
> for them, or (b) give them a warning that the command has no effect
> when sparse index is not turned on.  Thoughts?
So your concerns here should not be a problem, since the command
_does_ do the expected action even if index.sparse=false.

>>> I'm worried whether this is safe; if someone does a merge or rebase,
>>> there could be tracked-and-modified/conflicted files outside the
>>> sparse specification in the working tree.
>>
>> The conflicted files will not collapse to sparse directory entries.
>>
>> Does that ease your concern on that front?
> 
> Yes, that does ease my concerns...but it doesn't erase them.
> 
> If someone resolves the conflicted merge or rebase and commits (long
> before running this `git sparse-checkout clean` command), what happens
> to those paths?  Do these materialized paths persist in the worktree
> after the commit?  I know they did at some point in our
> implementation, and the current wording of `git sparse-checkout
> reapply` in the manual certainly suggests such paths may stick around
> until the user takes manual action:
> 
>            Reapply the sparsity pattern rules to paths in the working tree.
>            Commands like merge or rebase can materialize paths to do their
>            work (e.g. in order to show you a conflict), and other
>            sparse-checkout commands might fail to sparsify an individual file
>            (e.g. because it has unstaged changes or conflicts). In such cases,
>            it can make sense to run git sparse-checkout reapply later after
>            cleaning up affected paths (e.g. resolving conflicts, undoing or
>            committing changes, etc.).

I have a TODO to add test cases around this behavior, so we can
have concrete expectations. I'll incorporate them into the existing
test cases around merge conflicts.

This may be another example of Git leaving files around that should
be deleted in order to efficiently work with a sparse index.

> Now, if these paths stick around despite being outside the sparsity
> specification, users may decide to modify them.  And if they have
> modified them and run your command, won't you succeed in collapsing to
> a sparse directory tree?  And then wouldn't this cause unstaged
> changes to be discarded?  (I know this is a rare case, because users
> would probably only merge or rebase changes they made while in their
> sparse-checkout, and thus conflicts would generally be limited to the
> sparse-checkout, but there are at least three ways I can think of that
> conflicts could be triggered outside their sparse checkout, so I think
> it's a realistic scenario that we should think through.)
> 
> Throwing away unstaged changes is something that is usually gated
> behind a forcing flag (e.g. `git reset --hard` or `git checkout
> --force`).  I know, we currently also gate removal of untracked files
> in `git clean` behind a `--force` flag as well and I'm not concerned
> with throwing away untracked files with your command without a forcing
> flag, but I'm just wondering if we should be more careful with
> unstaged changes than with untracked files.
> 
> Perhaps everyone is fine with also throwing away unstaged changes as
> part of this command.  I could be convinced of that.  But if so, that
> should at least be called out rather explicitly in the commit message,
> the documentation, and the tests, whereas currently your patch is
> silent on anything other than untracked and ignored files.
> 
> (Or, if my memory is out-of-date about materialized paths persisting
> after their conflicts are resolved and a commit happens, then it'd
> probably be worth calling that out in the commit message and perhaps
> also updating some wording on the reapply subcommand.)

These are good considerations that I'll explore with tests (or point
to existing tests as I investigate). One thing to keep in mind is
that the SKIP_WORKTREE bit does some amount of ignoring the worktree
by not modifying what is there, and that may include some issues
around reporting the changes in 'status' or staging the changes in
'add'.

A lot of your concerns seem like they would be satisfied by providing
a verbose file-by-file output of what would be deleted and potentially
having --dry-run be the default mode.

>>> Even after resolving such a merge and committing, the paths may remain
>>> around until the user does a 'git sparse-checkout reapply' (I don't
>>> remember details here, but our documentation for reapply certainly
>>> says so), and since the file might stick around, the user may make
>>> further modifications to such a file.
>>>
>>> ...or will the convert_to_sparse() call above fail in all these cases?
>>>  If it does, should it give a better and more useful error message
>>> than "failed to convert index to a sparse index" and rather e.g. "path
>>> %s has modifications; please stage or revert first"?
>>
>> It won't fail. It just won't collapse as far.
> 
> Oh!  Based on this hint, I went and looked up the code for this; it's
> from convert_to_sparse_rec(), right?  I see something interesting
> there; does the present-despite-skipped checks (from 82386b44963f
> (Merge branch 'en/present-despite-skipped', 2022-03-09)) cause this
> collapsing to also fail for unstaged entries?  I.e. this part of
> convert_to_sparse_rec():
> 
>                 if (ce_stage(ce) ||
>                     S_ISGITLINK(ce->ce_mode) ||
>                     !(ce->ce_flags & CE_SKIP_WORKTREE))
>                         can_convert = 0;
> 
> The `ce_stage(ce)` part of it is what prevent it from collapsing when
> there are conflicts, and I think the `!(ce->ce_flags &
> CE_SKIP_WORKTREE))` would prevent it from collapsing any tracked files
> whatsoever, whether modified or not, due to the
> present-despite-skipped checks.  Does that sound right?

This matches my expectations.

> In other words, perhaps your clean command as implemented really does
> only handle untracked and ignored files, and if the user also has
> tracked-but-unmodified or tracked-with-unstaged-changes or
> tracked-with-staged-changes then this command won't actually restore
> performance for them until they _also_ run `git sparse-checkout
> reapply` ?

Worth adding testing to be sure, though I believe 'git sparse-checkout
clean' would help if they stage any changes (resolving conflicts, if
any) and commit those results. The files won't get cleaned by 'git
commit' but the clean operation should work after that.

>> You do make a good point that there could be extra help messages to say
>> that there are uncollapsed directories (detectable by seeing a blob path
>> with the skip-worktree bit on, maybe). I will think on this.
> 
> In order to detect this by the skip-worktree bit, you'd probably need
> to run it as part of the present-despite-skipped checks mentioned
> above (or run it before those checks clear the skip-worktree bit for
> present files).

I'll check in on this as I explore more deeply in the code.

Thanks for the careful review of these fine details!
-Stolee





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux