Re: buggy smudge/clean of empty files

Jeff King <peff@xxxxxxxx> · Thu, 22 May 2025 18:15:23 -0400

On Thu, May 22, 2025 at 03:01:38PM -0400, Joey Hess wrote:

> git seems to be buggy in its handling of empty files when smudge/clean filters
> are used.
> 
> I've attached a script setup_smudge_clean.sh, which configures a git
> repository to use a very simple smudge and clean filter pair for all
> files. The clean filter prepends a line "hi" to the file content, and the
> smudge filter removes the line. There is nothing very special about this
> smudge/clean, it's just a simple one for the sake of an example.
> 
> Here's the bug:

Thanks for a reproducible example. Running it through the debugger, I'd
guess the problem is in ce_match_stat_basic(), specifically this bit:

          /* Racily smudged entry? */
          if (!ce->ce_stat_data.sd_size) {
                  if (!is_empty_blob_oid(&ce->oid, the_repository->hash_algo))
                          changed |= DATA_CHANGED;
          }

That comes from f49c2c22fe (racy-git: an empty blob has a fixed object
name, 2008-06-10), which says:

      We use size=0 as the magic token to say the entry is known to be racily
      clean, but a sequence that does:
  
       - update the path with a non-empty blob and write the index;
       - update an unrelated path and write the index -- this smudges
         the above entry;
       - truncate the path to size zero.
  
      would make both the size field for the path in the index and the size on
      the filesystem zero.  We should not mistake it as a clean index entry.

but I suspect the is_empty_blob_oid() check is out of date for a world
with clean/smudge filters. The blob content inside the repository is
going to be "hi\n" in this case, so we will mark it as DATA_CHANGED. But
what we really want to know is: when smudged for the worktree, is the
content expected to be empty?

Something like the patch below, but it feels very dirty.

I wondered if we might be able to just catch these cases later in
diffcore (like we do for other stat-unmatch cases), but I do think this
conditional has false positives and false negatives. Your case is
confusing an empty file in the worktree which fails to match its smudged
content. But the opposite one is where the file should have content in
the worktree (due to smudging), but is cleaned to empty inside the
repository.

So I dunno. I'm hoping somebody more familiar with the index and/or
clean/smudge conversions can show a better way.

-Peff

diff --git a/read-cache.c b/read-cache.c
index 73f83a7e7a..0f19440514 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -48,6 +48,7 @@
 #include "csum-file.h"
 #include "promisor-remote.h"
 #include "hook.h"
+#include "convert.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -342,8 +343,37 @@ static int ce_match_stat_basic(const struct cache_entry *ce, struct stat *st)
 
 	/* Racily smudged entry? */
 	if (!ce->ce_stat_data.sd_size) {
-		if (!is_empty_blob_oid(&ce->oid, the_repository->hash_algo))
+		/*
+		 * Yuck, we'd really like to be able to ask if there is any
+		 * conversion configured so we can just check the oid in
+		 * the common non-smudge case. But there is no worktree
+		 * equivalent to would_convert_to_git().
+		 *
+		 * It would not be correct to check is_empty_blob_oid() first
+		 * here (and skip the more expensive check). I think that would
+		 * be wrong for cases where the clean in-repo blob is empty,
+		 * but the smudged version has data.
+		 */
+		char *data;
+		unsigned long len;
+		enum object_type type;
+		struct strbuf expected_wt = STRBUF_INIT;
+
+		/*
+		 * skip error handling for this example. What would we do? Set
+		 * DATA_CHANGED pessimistically?
+		 */
+		data = repo_read_object_file(the_repository,
+					     &ce->oid,
+					     &type, &len);
+		convert_to_working_tree(the_repository->index,
+					ce->name, data, len,
+					&expected_wt, NULL);
+
+		if (expected_wt.len)
 			changed |= DATA_CHANGED;
+		strbuf_release(&expected_wt);
+		free(data);
 	}
 
 	return changed;