Re: sparse-checkout and symlinks?

Elijah Newren <newren@xxxxxxxxx> · Sat, 10 May 2025 10:29:45 -0700

On Sat, May 10, 2025 at 9:05 AM Gabriel Scherer <Gabriel.Scherer@xxxxxxx> wrote:
>
> Dear git list,
>
> sparse-checkout interacts badly with symlinks within a git repository:
> if b/file is a symlink to a/file, and the user asks for a
> sparse-checkout with only b/, they get a dead link (b/file points to
> nothing).

That's what I'd expect.

> I initially assumed that replacing a file by a symlink to another file
> with the same content would not be observable by other users of the
> repository. This assumption is incorrect in presence of sparse checkouts.
>
> I would find it natural to have sparse-checkout "follow symlinks". When
> checking b/file as the user requests, git would notice that it is a
> symlink and do one of the following:
>
> 1. if the link target a/file is not in the specified sparse checkout
> set, copy its content instead of creating a dead symlink
>     (Downside: this could lead to duplication if several in-checkout
> files point to a/file.)

And the file would immediately show as modified in status, which seems
like a rather negative surprise.  If someone does a `git add -u` or
similar, they'd convert the symlink to a regular file, which could be
another form of gotcha.

> 2. or add a/file to the sparse checkout set
>     (Note: simply checking it out silently is not enough as 'reapply'
> would then drop it)

This is a solution that does not work with the default cone mode.  It
may also surprise users who expected the sparse checkout rules to be
something entirely under their control.

Both solutions would also interact rather poorly with sparse indexes;
either looks to me like a bit of a foot-gun for them.

> Does this sound reasonable to you? Would you have recommendations on
> what the interface for such a feature should look like?
> - which of the alternatives above would you recommend?

Honestly, neither.  The problem isn't limited to symlinks; some examples:
  * you could have a script in one part of the checkout that tries to
invoke a script in the other part
  * you could have a source code file in the non-sparse part that has
a directive to include/import/require source code in the
non-sparse-checkout
...and there are many other ways files could depend on others.

Symlinks are only special in that they require no programming or other
knowledge to determine that there is a dependency between files.

I'd rather continue to follow the expectation that users of sparse
checkouts need to determine the relevant set of dependencies and
determine which sparsity rules make sense in their repo.  I suspect
that each repo might be somewhat special here, and thus each might
have their own tool for creating sparse-checkouts using repo-specific
knowledge (e.g. "I want moduleA plus whatever it depends upon") which
their repo-specific tool then translates into the appropriate set of
paths or patterns to use.  symlinks would be just one of many kinds of
dependencies that such a tool would consider.  I understand that some
repos might be big enough that users want to use sparse-checkouts, but
not big enough that one of the developers wants to write such a tool.
Still, I'd rather not attempt dependency analysis in git[*], and
instead require the users to do the dependency analysis.

> - should this be enabled only by a new configuration or command-line
> option (to which subcommand?), how would you name it?
>
> Thanks in advance
>
>
> ## More details on the use-case
>
> I'm trying to reduce the working directory size of a gigabyte-large git
> repository ( https://github.com/typst/packages
> <https://github.com/typst/packages> ) which contains a substantial
> amount of duplicated files, by replacing duplicates by symlinks. The
> repository uses a continuous integration script to run automated tests
> on each proposed change, which uses sparse-checkout on only the
> directories listed as containing modified files.(The directories
> correspond to independent "packages" so it makes sense to check them
> separately.) This breaks when the modified directories contain symlinks
> to other, non-modified directories.

I know it's not quite what you want to hear, but I believe a better
solution here is to have your script check for the dependencies it
needs (via symlinks, in this case) and include those dependencies in
the sparse-checkout it creates.

Hope that helps,
Elijah

[*] I'll add a slight carve-out to this statement if there was a
git-specific way to declare dependencies that we can then parse.  Such
a thing has been proposed before; see
https://lore.kernel.org/git/pull.627.git.1588857462.gitgitgadget@xxxxxxxxx/
.  However, multiple gotchas were identified that derailed that
proposal, so those would need some solutions.  Even if we were to do
that, though, you'd still have to specify the dependency explicitly in
some additional file rather than just depending upon the symlink.
Further, that particular proposal would have only worked with cone
mode which goes against your specific request here.