On Tue, Jul 8, 2025 at 4:19 AM Derrick Stolee via GitGitGadget <gitgitgadget@xxxxxxxxx> wrote: > > When using cone-mode sparse-checkout, users specify which tracked > directories they want (recursively) and any directory not part of the parent > paths for those directories are considered "out of scope". When changing > sparse-checkouts, there are a variety of reasons why these "out of scope" > directories could remain, including: > > * The user has .gitignore or .git/info/exclude files that tell Git to not > remove files of a certain type. > * Some filesystem blocker prevented the removal of a tracked file. This is > usually more of an issue on Windows where a read handle will block file > deletion. > > Typically, this would not mean too much for the user experience. A few extra > filesystem checks might be required to satisfy git status commands, but the > scope of the performance hit is relative to how many cruft files are left > over in this situation. > > However, when using the sparse index, these tracked sparse directories cause > significant performance issues. When noticing that the index contains a > sparse directory but that directory exists on disk, Git needs to expand that > sparse directory to determine which files are tracked or untracked. The > current mechanism expands the entire index to a full one, an expensive > operation that scales with the total number of paths at HEAD and not just > the number of cruft files left over. > > Advice was added in 9479a31d603 (advice: warn when sparse index expands, > 2024-07-08) to help users determine that they were in this state. However, > the advice doesn't actually recommend helpful ways to get out of this state. > Recommending "git clean" on its own is incomplete, as typically users > actually need 'git clean -dfx' to clear out the ignored or excluded files. > Even then, they may need 'git sparse-checkout reapply' afterwards to clear > the sparse directories. > > The advice was successful in helping to alert users to the problem, which is > how I got wind of many of these cases for how users get into this state. > It's now time to give them a tool that helps them out of this state. > > This series adds a new 'git sparse-checkout clean' command that currently > only works for cone-mode sparse-checkouts. The only thing it does is > collapse the index to a sparse index (as much as possible) and make sure > that any sparse directories are removed. These directories are listed to > stdout. But what does it clean up? - untracked files? - ignored files? - tracked-but-unmodified files? - tracked-and-modified files? - tracked-and-conflicted files? (which is probably a subset of tracked-and-modified, but thought I'd call it out) Note: "tracked" probably has a slightly ambiguous connotation here since we sometimes mean "is it in the index", and there's a difference between "would it be in the sparse index" and "would it be in the fully expanded index". Here, by "tracked" I mean the latter -- "is it in the fully expanded index". > A --dry-run option is available to list the directories that would be > removed without actually deleting the directories. > > This option would be preferred to something like 'git clean -dfx' since it > does not clear the excluded files that are still within the sparse-checkout. This seems to suggest you are only interested in untracked and ignored files. I'm sure that's by far the most common case, but I'm curious about the others. Are you expecting users to sometimes need to run both 'git sparse-checkout clean' and 'git sparse-checkout reapply'? > Instead, it performs the exact filesystem operations required to refresh the > sparse index performance back to what is expected. But what operations are those and what is expected? As you mentioned above, for untracked or ignored files, the expectation is that those would be removed. I think if there are tracked-but-unmodified files, I'd expect those to be removed as well. If only the above filetypes exist, then we'd expect the directory to be nuked and sparse index performance to be improved back to "normal". However, if there are tracked-and-modified files, I'd expect an error and for the sparse index performance to continue to suffer until those paths are resolved. (Or, pie-in-sky spitballing:maybe we could attempt to do something smarter like make sibling directories to the tracked-and-modified path be treated as sparse directories, so that performance only suffers a little). > I spent a few weeks debating with myself about whether or not this was the > right interface, so please suggest alternatives if you have better ideas. > Among my rejected ideas include: > > * 'git sparse-checkout reapply -f -x' or similar augmentations of > 'reapply'. The connection to sparse-checkout reapply at least would make it clearer what you are doing with tracked files, since its explanation explicitly mentions those. However, reapply doesn't say anything about untracked or ignored files, which we'd need to start explaining and perhaps isn't as clean a fit, especially since your new usecase is predominantly about untracked and ignored files. I don't have a strong opinion here, but I think I also like your choice of a separate 'clean' subcommand better. > * 'git clean --sparse' to focus the clean operation on things outside of > the sparse-checkout. Yeah, this choice would have likely prevented you from cleaning up tracked files, and required users to run both 'clean --sparse' and 'sparse-checkout reapply'. And this command feels more tightly connected to sparse-checkouts to me, so I wouldn't have liked this choice either. > The implementation is rather simple with the current CLI. Future > augmentations could include a --quiet option to silence the output and a > --verbose option to list the files that exist within each directory and > would/will be removed. I'm also curious what happens when (1) you are in cone mode and there is no sparse index, or (2) when you are not in cone mode. I suspect those and the questions above will be answered as I read the individual patches, so I'll keep going... > Thanks, -Stolee > > Derrick Stolee (3): > sparse-checkout: remove use of the_repository > sparse-checkout: add 'clean' command > sparse-index: point users to new 'clean' action > > Documentation/git-sparse-checkout.adoc | 13 +- > builtin/sparse-checkout.c | 192 +++++++++++++++++-------- > sparse-index.c | 3 +- > t/t1091-sparse-checkout-builtin.sh | 48 +++++++ > 4 files changed, 197 insertions(+), 59 deletions(-) > > > base-commit: 8b6f19ccfc3aefbd0f22f6b7d56ad6a3fc5e4f37 > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1941%2Fderrickstolee%2Fgit-sparse-checkout-clean-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1941/derrickstolee/git-sparse-checkout-clean-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/1941 > -- > gitgitgadget