Re: [PATCH] xdiff: disable cleanup_records heuristic with --minimal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Niels

On 25/04/2025 16:59, Niels Glodny wrote:
The cleanup_records function marks some lines as changed
before running the actual diff algorithm. For most lines,
this is a good performance optimization, but it also marks
lines that are surrounded by many changed lines as changed
as well. This can cause redundant changes and longer-than-
necessary diffs.

Nicely explained. We normally wrap commit messages at 72 characters

Whether this results in better-looking diffs is subjective.
However, the --minimal flag explicitly requests the shortest
possible diff.

Looking at the diff for 2ff58dec493 (refs: introduce function to batch refname availability checks, 2025-03-12) I'd say it is definitely less readable with this change but some of the changes in 320f2061b63 (t0602: use subshell to ensure working directory unchanged, 2025-02-28) are more readable. Anyway as you say if we promise to find the minimal diff then that is what we should do.
The performance impact of this change is negligible, and it
results in shorter diffs in about 1.3% of diffs in Git's
history.

Have you got any numbers for the performance change?

I think the premise of this patch is sound, I've left a couple of comments below as I think we can simplify the code changes.

diff --git a/t/t4071-diff-minimal.sh b/t/t4071-diff-minimal.sh
new file mode 100755
index 0000000000..3ad759dab4
--- /dev/null
+++ b/t/t4071-diff-minimal.sh
@@ -0,0 +1,16 @@
+#!/bin/sh
+
+test_description='minimal diff algorithm'
+
+. ./test-lib.sh
+
+test_expect_success 'minimal diff should not mark changes between changed lines' '
+	printf "x\nx\nx\nx\n" >pre &&
+	printf "x\nx\nx\nA\nB\nC\nD\nx\nE\nF\nG\n" >post &&
+	test_must_fail git diff --no-index \
+		--minimal pre post >diff &&
+	! grep "+x" diff &&
+	! grep "-x" diff
+'

Thanks for taking the trouble to add a new test. We have a few test helper functions which we could use here.

	test_write_lines x x x x >pre &&
	test_write_lines x x x A B C D x E F G >post &&
	test_expect_code 1 git diff --no-index --minimal pre post >diff &&
	test_grep ! ^[-+]x diff

test_expect_code ensures that the non-zero exit code is from the files not matching rather than another error like an invalid option or missing file. test_grep will display the file if there are matches which helps when debugging test failures.

> [...]> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
+static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1,
+			       xdfile_t *xdf2, int need_min) {
  	long i, nm, nreff, mlim;
  	xrecord_t **recs;
  	xdlclass_t *rcrec;
@@ -379,7 +383,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
xdlclassifier_t carries a copy of the flag that were interested in so I think at the start of xdl_cleanup_records we can add

	int need_min = !!(cf->flags & XDF_NEED_MINIMAL);

and then we don't need to change the signature.

  	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
  		rcrec = cf->rcrecs[(*recs)->ha];
  		nm = rcrec ? rcrec->len2 : 0;
-		dis1[i] = (nm == 0) ? 0: (nm >= mlim) ? 2: 1;
+		dis1[i] = (nm == 0) ? 0: (nm >= mlim && !need_min) ? 2: 1;

If we want a minimal diff then we force dis1[i] to be 1 so that this line will never be marked as changed before searching for the longest common sequence. Well spotted.

Best Wishes

Phillip

  	}
if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
@@ -387,7 +391,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
  	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
  		rcrec = cf->rcrecs[(*recs)->ha];
  		nm = rcrec ? rcrec->len1 : 0;
-		dis2[i] = (nm == 0) ? 0: (nm >= mlim) ? 2: 1;
+		dis2[i] = (nm == 0) ? 0: (nm >= mlim && !need_min) ? 2: 1;
  	}
for (nreff = 0, i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
@@ -449,10 +453,10 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
  }
-static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
-
+static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1,
+	xdfile_t *xdf2, int need_min) {
  	if (xdl_trim_ends(xdf1, xdf2) < 0 ||
-	    xdl_cleanup_records(cf, xdf1, xdf2) < 0) {
+	    xdl_cleanup_records(cf, xdf1, xdf2, need_min) < 0) {
return -1;
  	}

base-commit: f65182a99e545d2f2bc22e6c1c2da192133b16a3





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux