Re: [PATCH 0/3] sparse-checkout: add 'clean' command

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 8, 2025 at 4:19 AM Derrick Stolee via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
>
> When using cone-mode sparse-checkout, users specify which tracked
> directories they want (recursively) and any directory not part of the parent
> paths for those directories are considered "out of scope". When changing
> sparse-checkouts, there are a variety of reasons why these "out of scope"
> directories could remain, including:
>
>  * The user has .gitignore or .git/info/exclude files that tell Git to not
>    remove files of a certain type.
>  * Some filesystem blocker prevented the removal of a tracked file. This is
>    usually more of an issue on Windows where a read handle will block file
>    deletion.
>
> Typically, this would not mean too much for the user experience. A few extra
> filesystem checks might be required to satisfy git status commands, but the
> scope of the performance hit is relative to how many cruft files are left
> over in this situation.
>
> However, when using the sparse index, these tracked sparse directories cause
> significant performance issues. When noticing that the index contains a
> sparse directory but that directory exists on disk, Git needs to expand that
> sparse directory to determine which files are tracked or untracked. The
> current mechanism expands the entire index to a full one, an expensive
> operation that scales with the total number of paths at HEAD and not just
> the number of cruft files left over.
>
> Advice was added in 9479a31d603 (advice: warn when sparse index expands,
> 2024-07-08) to help users determine that they were in this state. However,
> the advice doesn't actually recommend helpful ways to get out of this state.
> Recommending "git clean" on its own is incomplete, as typically users
> actually need 'git clean -dfx' to clear out the ignored or excluded files.
> Even then, they may need 'git sparse-checkout reapply' afterwards to clear
> the sparse directories.
>
> The advice was successful in helping to alert users to the problem, which is
> how I got wind of many of these cases for how users get into this state.
> It's now time to give them a tool that helps them out of this state.
>
> This series adds a new 'git sparse-checkout clean' command that currently
> only works for cone-mode sparse-checkouts. The only thing it does is
> collapse the index to a sparse index (as much as possible) and make sure
> that any sparse directories are removed. These directories are listed to
> stdout.

But what does it clean up?
  - untracked files?
  - ignored files?
  - tracked-but-unmodified files?
  - tracked-and-modified files?
  - tracked-and-conflicted files? (which is probably a subset of
tracked-and-modified, but thought I'd call it out)

Note: "tracked" probably has a slightly ambiguous connotation here
since we sometimes mean "is it in the index", and there's a difference
between "would it be in the sparse index" and "would it be in the
fully expanded index".  Here, by "tracked" I mean the latter -- "is it
in the fully expanded index".

> A --dry-run option is available to list the directories that would be
> removed without actually deleting the directories.
>
> This option would be preferred to something like 'git clean -dfx' since it
> does not clear the excluded files that are still within the sparse-checkout.

This seems to suggest you are only interested in untracked and ignored
files.  I'm sure that's by far the most common case, but I'm curious
about the others.  Are you expecting users to sometimes need to run
both 'git sparse-checkout clean' and 'git sparse-checkout reapply'?

> Instead, it performs the exact filesystem operations required to refresh the
> sparse index performance back to what is expected.

But what operations are those and what is expected?

As you mentioned above, for untracked or ignored files, the
expectation is that those would be removed.

I think if there are tracked-but-unmodified files, I'd expect those to
be removed as well.

If only the above filetypes exist, then we'd expect the directory to
be nuked and sparse index performance to be improved back to "normal".

However, if there are tracked-and-modified files, I'd expect an error
and for the sparse index performance to continue to suffer until those
paths are resolved.  (Or, pie-in-sky spitballing:maybe we could
attempt to do something smarter like make sibling directories to the
tracked-and-modified path be treated as sparse directories, so that
performance only suffers a little).

> I spent a few weeks debating with myself about whether or not this was the
> right interface, so please suggest alternatives if you have better ideas.
> Among my rejected ideas include:
>
>  * 'git sparse-checkout reapply -f -x' or similar augmentations of
>    'reapply'.

The connection to sparse-checkout reapply at least would make it
clearer what you are doing with tracked files, since its explanation
explicitly mentions those.  However, reapply doesn't say anything
about untracked or ignored files, which we'd need to start explaining
and perhaps isn't as clean a fit, especially since your new usecase is
predominantly about untracked and ignored files.  I don't have a
strong opinion here, but I think I also like your choice of a separate
'clean' subcommand better.

>  * 'git clean --sparse' to focus the clean operation on things outside of
>    the sparse-checkout.

Yeah, this choice would have likely prevented you from cleaning up
tracked files, and required users to run both 'clean --sparse' and
'sparse-checkout reapply'.  And this command feels more tightly
connected to sparse-checkouts to me, so I wouldn't have liked this
choice either.

> The implementation is rather simple with the current CLI. Future
> augmentations could include a --quiet option to silence the output and a
> --verbose option to list the files that exist within each directory and
> would/will be removed.

I'm also curious what happens when (1) you are in cone mode and there
is no sparse index, or (2) when you are not in cone mode.  I suspect
those and the questions above will be answered as I read the
individual patches, so I'll keep going...

> Thanks, -Stolee
>
> Derrick Stolee (3):
>   sparse-checkout: remove use of the_repository
>   sparse-checkout: add 'clean' command
>   sparse-index: point users to new 'clean' action
>
>  Documentation/git-sparse-checkout.adoc |  13 +-
>  builtin/sparse-checkout.c              | 192 +++++++++++++++++--------
>  sparse-index.c                         |   3 +-
>  t/t1091-sparse-checkout-builtin.sh     |  48 +++++++
>  4 files changed, 197 insertions(+), 59 deletions(-)
>
>
> base-commit: 8b6f19ccfc3aefbd0f22f6b7d56ad6a3fc5e4f37
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1941%2Fderrickstolee%2Fgit-sparse-checkout-clean-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1941/derrickstolee/git-sparse-checkout-clean-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1941
> --
> gitgitgadget





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux