Re: [PATCH 2/3] sparse-checkout: add 'clean' command

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 08 Jul 2025 14:20:27 -0700

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> From: Derrick Stolee <stolee@xxxxxxxxx>
>
> When users change their sparse-checkout definitions to add new
> directories and remove old ones, there may be a few reasons why
> directories no longer in scope remain (ignored or excluded files still
> exist, Windows handles are still open, etc.). When these files still
> exist, the sparse index feature notices that a tracked, but sparse,
> directory still exists on disk and thus the index expands. This causes a
> performance hit _and_ the advice printed isn't very helpful. Using 'git
> clean' isn't enough (generally '-dfx' may be needed) but also this may
> not be sufficient.
>
> Add a new subcommand to 'git sparse-checkout' that removes these
> tracked-but-sparse directories, including any excluded or ignored files

Are excluded files and ignored files form two separate sets, or are
they one and the same?  Do files that users forgot to add (e.g. new
source file that would not match any patterns listed in .gitignore)
and object files left over from the previous compilation (most
likely match *.o in .gitignore) treated the same way for the purpose
of determining if the directory that is no longer in the cone can be
removed?

> underneath. This is the most extreme method for doing this, but it works
> when the sparse-checkout is in cone mode and is expected to rescope
> based on directories, not files.
>
> Be sure to add a --dry-run option so users can predict what will be
> deleted. In general, output the directories that are being removed so
> users can know what was removed.

Hmph.  It would be safer to show not just the directories but which
excluded files are about to be lost, wouldn't it, especially when
the user is trying to play safe and see what potential damage they
are looking at?

Also even though ignored files are "ignored and expendable", nobody
marks their temporary file as "ignored but precious" (yet), so "it
is listed in .gitignore so we can safely remove it" may not be a
safe assumption for us to be making (yet).  Shouldn't we at least be
listing these ignored files in --dry-run output, next to those files
that the user may have forgotten to add?

> Note that untracked directories remain. Further, directories that
> contain staged changes are not deleted. This is a detail that is partly
> hidden by the implementation which relies on collapsing the index to a
> sparse index in-memory and only deleting directories that are listed as
> sparse in the index. If a staged change exists, then that entry is not
> stored as a sparse tree entry and thus remains on-disk until committed
> or reset.

Removing untracked directories is a job for "clean -d", so it makes
sense for this new command not to touch them.  Not losing changes
that have already been added is just a bad as losing new files that
the user forgot to add, so it does make sense not to remove them.

I wonder if we need "-x" and/or "-X" options "clean" has (and
perhaps "-d" that is a no-op, as the whole point of this subcommand
is about removing directories from the working tree) to control its
operation a bit finer-grained way.

> +	for (size_t i = 0; i < repo->index->cache_nr; i++) {
> +		DIR* dir;

The asterisk sticks to the variable, not the type, i.e.

		DIR *dir;

Thanks.