Re: [PATCH v3] diff: ensure consistent diff behavior with ignore options

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 06 Aug 2025 10:35:17 -0700

Lidong Yan <yldhome2d2@xxxxxxxxx> writes:

> In git-diff, options like `-w` and `-I<regex>` require comparing
> file contents to determine whether two files are the same, even when
> their SHA values differ. For options like `--raw`, `--name-status`,
> and `--name-only`, git-diff deliberately compares only the SHA values
> to determine whether two files are the same, for performance reasons.
> As a result, a file shown in `git diff --name-status` may not appear
> in `git diff --patch`.
>
> To quickly determine whether two files are identical, Add helper
> function diff_flush_patch_quiet() in diff.c. Add `.diff_optimize`
> field in `struct diff_options`. When `.diff_optimize` is set to
> `DIFF_OPT_DRY_RUN`, builtin_diff() will return immediately upon
> detecting any change. Call diff_flush_patch_quiet() to determine
> if we should flush `--raw`, `--name-only` or `--name-status` output.
>
> Signed-off-by: Jeff King <peff@xxxxxxxx>
> Signed-off-by: Lidong Yan <yldhome2d2@xxxxxxxxx>
> ---
>  diff.c                     | 67 +++++++++++++++++++++++++++++---------
>  diff.h                     |  5 +++
>  t/t4013-diff-various.sh    | 14 ++++++++
>  t/t4015-diff-whitespace.sh |  2 +-
>  xdiff-interface.h          |  6 ++--
>  5 files changed, 74 insertions(+), 20 deletions(-)

The code looks much easier to reason about than the previous rounds.

A few comments about the design.

 - Are there other possible values that might fit in this "optimize"
   member, and what kind of behaviour would they trigger, that we
   can envision?  I do not think of any and that is why the "enum
   diff_optimize" member in the diff_options structure smells more
   like a "bool dry_run".

   By the way, giving a member "diff_" prefix when the enclosing
   struct is clearly about "diff" by having a name "diff_options" is
   often a waste of readers' time.

 - It is unclear why the dry-run need to imply 0-line context.

 - diff_flush_patch_quietly() would be a better name for
   diff_flush_patch_quiet().

On to the details.

> diff --git a/diff.c b/diff.c
> index dca87e164f..5254ef9373 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2444,6 +2444,15 @@ static int fn_out_consume(void *priv, char *line, unsigned long len)
>  	return 0;
>  }
>  
> +static int quick_consume(void *priv, char *line, unsigned long len)
> +{
> +	struct emit_callback *ecbdata = priv;
> +	struct diff_options *o = ecbdata->opt;
> +
> +	o->found_changes = 1;
> +	return 1;
> +}

OK, as a non-zero return value from consume callbacks is supposed
to signal an error and causes xdiff_outf() an early return, this
serves as a short-cut.  One downside is that we cannot truly notice
and signal an error to our callers, as we will see in a later hunk.

> @@ -3709,6 +3718,7 @@ static void builtin_diff(const char *name_a,
>  		xdemitconf_t xecfg;
>  		struct emit_callback ecbdata;
>  		const struct userdiff_funcname *pe;
> +		int dry_run = o->diff_optimize == DIFF_OPT_DRY_RUN;

And this screams that o->dry_run that is a Boolean may be
sufficient, but I could be missing obvious future enhancement
opportunities.

> @@ -3741,8 +3751,8 @@ static void builtin_diff(const char *name_a,
>  		xpp.ignore_regex_nr = o->ignore_regex_nr;
>  		xpp.anchors = o->anchors;
>  		xpp.anchors_nr = o->anchors_nr;
> -		xecfg.ctxlen = o->context;
> -		xecfg.interhunkctxlen = o->interhunkcontext;
> +		xecfg.ctxlen = dry_run ? 0 : o->context;
> +		xecfg.interhunkctxlen = dry_run ? 0 : o->interhunkcontext;

Unclear why.  I think you had a similar change with a comment ...

> @@ -3750,7 +3760,8 @@ static void builtin_diff(const char *name_a,
>  			xdiff_set_find_func(&xecfg, pe->pattern, pe->cflags);
>  
>  		diffopts = getenv("GIT_DIFF_OPTS");
> -		if (!diffopts)
> +		/* ignore ctxlen if we are in dry run mode */

... here, but this comment is totally useless.

> +		if (!diffopts || dry_run)
>  			;
>  		else if (skip_prefix(diffopts, "--unified=", &v))
>  			xecfg.ctxlen = strtoul(v, NULL, 10);

Anybody who can read the code can tell that dry_run disables the
xecfg.ctxlen handling we have below.  What the code does not tell,
hence the author of a patch must help the readers by writing *why*
ignoring patch context is safe, correct, necessary, and desirable
when the main non-dry-run invocation of the diff machinery, which
this dry-run mode is trying to help, may use some context lines.

It does not have to be done in in-code comment.  Especially because
the consequence of the design decision to "ignore context" appears
in two different places, the proposed log message would be a better
place to explain why it is a safe, correct, necessary and desirable
thing to do.

> -		if (xdi_diff_outf(&mf1, &mf2, NULL, fn_out_consume,
> -				  &ecbdata, &xpp, &xecfg))
> +		if (dry_run)
> +			xdi_diff_outf(&mf1, &mf2, NULL, quick_consume,
> +				      &ecbdata, &xpp, &xecfg);
> +		else if (xdi_diff_outf(&mf1, &mf2, NULL, fn_out_consume,
> +				       &ecbdata, &xpp, &xecfg))
>  			die("unable to generate diff for %s", one->path);

And this is the consequence of xdi_diff_outf() not having an
extensible way to report different "abnormal" conditions to its
caller.

In the normal case, all exceptional/abnormal conditions are error.
But in the dry-run case, "abnormal return is an error and we should
die" would not fit our purpose.  "Assume that non-zero 'success'
return is because our quick_consume() is signalling that it found
what it needs to find, and that is not an error" is what is going on
here.

I'll stop here for now.  Will continue later.

Thanks.