Re: [PATCH v4 1/1][GSOC] userdiff: extend Bash pattern to cover more shell function forms

Johannes Sixt <j6t@xxxxxxxx> · Tue, 6 May 2025 18:30:04 +0200

Am 30.03.25 um 15:39 schrieb Moumita:
> From: Moumita Dhar <dhar61595@xxxxxxxxx>
> 
> The previous function regex required explicit matching of function
> bodies using `{`, `(`, `((`, or `[[`, which caused several issues:
> 
> - It failed to capture valid functions where `{` was on the next line
>   due to line continuation (`\`).
> - It did not recognize functions with single  command body, such as
>   `x () echo hello`.

Good.

> Replacing the function body matching logic with `.*$`, ensures
> that everything on the function definition line is captured,
> aligning with other userdiff drivers and improving hunk headers in
> `git diff`.

Personally, I am a bit allergic against marketing speak. *Of course* do
we intend improve the code. No need to mention it.

> Additionally, the word regex is refined to better recognize shell
> syntax, including additional parameter expansion operators and
> command-line options, improving syntax-aware diffs.

Good. (But see above about "improving".)

> 
> Signed-off-by: Moumita Dhar <dhar61595@xxxxxxxxx>
> ---

> diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh
> index f51d3557f1..0be647c2fb 100755
> --- a/t/t4034-diff-words.sh
> +++ b/t/t4034-diff-words.sh
> @@ -320,6 +320,7 @@ test_expect_success 'unset default driver' '
>  
>  test_language_driver ada
>  test_language_driver bibtex
> +test_language_driver bash
>  test_language_driver cpp
>  test_language_driver csharp
>  test_language_driver css

Including a test for word-diff is very much appreciated!

> diff --git a/t/t4034/bash/expect b/t/t4034/bash/expect
> new file mode 100644
> index 0000000000..a0f7cbd5a3
> --- /dev/null
> +++ b/t/t4034/bash/expect
> @@ -0,0 +1,30 @@
> +<BOLD>diff --git a/pre b/post<RESET>
> +<BOLD>index 09ac008..60ba6a2 100644<RESET>
> +<BOLD>--- a/pre<RESET>
> +<BOLD>+++ b/post<RESET>
> +<CYAN>@@ -1,25 +1,25 @@<RESET>
> +<RED>my_var<RESET><GREEN>new_var<RESET>=10
> +x=<RED>123<RESET><GREEN>456<RESET>
> +y=<RED>3.14<RESET><GREEN>2.71<RESET>
> +z=<RED>.5<RESET><GREEN>.75<RESET>

OK.

> +echo <RED>$USER<RESET><GREEN>$USERNAME<RESET>

This tests a typical variable expansion. Good.

> +${<RED>HOME<RESET><GREEN>HOMEDIR<RESET>}

A more elaborate variable expansion does not include the surrounding ${
}. Good.

> +if [ "<RED>$a<RESET><GREEN>$x<RESET>" == "<RED>$b<RESET><GREEN>$y<RESET>" ] || [ "<RED>$c<RESET><GREEN>$x<RESET>" != "<RED>$d<RESET><GREEN>$y<RESET>" ]; then echo "OK"; fi

This line also only tests variable expansions and is quite redundant. It
could test the operators that we see, but it doesn't. See below for
ideas how to do that.

And all from here...

> +((<RED>a<RESET><GREEN>x<RESET>+=<RED>b<RESET><GREEN>y<RESET>))
> +((<RED>a<RESET><GREEN>x<RESET>-=<RED>b<RESET><GREEN>y<RESET>))
> +$((<RED>a<RESET><GREEN>x<RESET><<<RED>b<RESET><GREEN>y<RESET>))
> +$((<RED>a<RESET><GREEN>x<RESET>>><RED>b<RESET><GREEN>y<RESET>))
> +${<RED>a<RESET><GREEN>x<RESET>:-<RED>b<RESET><GREEN>y<RESET>}
> +${<RED>a<RESET><GREEN>x<RESET>:=<RED>b<RESET><GREEN>y<RESET>}
> +${<RED>a<RESET><GREEN>x<RESET>##*/}
> +${<RED>a<RESET><GREEN>x<RESET>%.*}
> +${<RED>a<RESET><GREEN>x<RESET>%%.*}
> +${<RED>a<RESET><GREEN>x<RESET>^^}
> +${<RED>a<RESET><GREEN>x<RESET>,}
> +${<RED>a<RESET><GREEN>x<RESET>,,}

... to here also only test variable expansions, but not the operators.
As written, they are totally redundant and should be dropped or improved.

To test, for example, that '##' is kept together as an operator, you
have to have

   ${x#*/}

in the pre-image and

   ${x##*/}

in the post-image, so that we see

   ${x<RED>#<RESET><GREEN>##<RESET>*/}

in the result. If '##' were two tokens (due to a bug in the pattern), we
would see

   ${x#<GREEN>#<RESET>*/}

in the result.

You should go through all not-single-character operators and have a
pre-image and a post-image where one characters is added or removed. The
following is not a correct test:

pre:    ((x+=b))
post:   ((x-=b))
result: ((x<RED>+=<RESET><GREEN>-=<RESET>))

because we would see this result even if the pattern has a bug that
recognizes only one of += or -=, but would split the other one. A
correct test would be:

pre:    ((x+b))
post:   ((x+=b))
result: ((x<RED>+<RESET><GREEN>+=<RESET>))

> +${!<RED>a<RESET><GREEN>x<RESET>}

Here, you should have no '!' in the pre-image and the '!' in the post
image and not change the name. Then the test demonstrates that '!' is
its own token and not merged with '${' (if that is the intent of the
test; otherwise it is a redunant test).

> +${<RED>a<RESET><GREEN>x<RESET>[@]}

If you want to test that '@' is not merged with the brackets, then you
can have, for example '*' in the pre-image and '@' in the post image.

> +${<RED>a<RESET><GREEN>x<RESET>:?error message}

Either redundant or another two-character operator to test.

> +${<RED>a<RESET><GREEN>x<RESET>:2:3}

Redundant.

> +ls <RED>-a<RESET><GREEN>-x<RESET>
> +ls <RED>--a<RESET><GREEN>--x<RESET>

Both are good tests.

> diff --git a/userdiff.c b/userdiff.c
> index 340c4eb4f7..4c77c7e0f6 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -64,15 +64,27 @@ PATTERNS("bash",
>  	     /* Bashism identifier with optional parentheses */
>  	     "(function[ \t]+[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))"
>  	 ")"
> -	 /* Optional whitespace */
> -	 "[ \t]*"
> -	 /* Compound command starting with `{`, `(`, `((` or `[[` */
> -	 "(\\{|\\(\\(?|\\[\\[)"
> +	 /* Everything after the function header is captured  */
> +	 ".*$"

I remember suggesting to capture everything after the function header.
However, If I am not mistaken, this does not do what I intended (and as
written it means that a pointless matching operation happens). The hunk
header shows everything that is in the outermost parentheses (group).
This catch-all, however, is even outside the outermost group and not
captured. It should be above the closing parenthesis that we see in the
context.

To test for this, you can have one test where "RIGHT" is not the
function name, but in the comment on the function header line.

>  	 /* End of captured text */
>  	 ")",
>  	 /* -- */
> -	 /* Characters not in the default $IFS value */
> -	 "[^ \t]+"),
> +	 /* Identifiers: variable and function names */
> +	  "[a-zA-Z_][a-zA-Z0-9_]*"
> +	 /* Numeric constants: integers and decimals */
> +	  "|[0-9]+(\\.[0-9]*)?|[-+]?\\.[0-9]+"
> +	 /* Shell variables: $VAR, ${VAR} */
> +	  "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{"
> +	  /* Logical and comparison operators */
> +	 "|\\|\\||&&|<<|>>|==|!=|<=|>="
> +	 /* Assignment and arithmetic operators */
> +	 "|[-+*/%&|^!=<>]=?"
> +	 /* Additional parameter expansion operators */
> +	 "|:?=|:-|:\\+|:\\?|:|#|##|%|%%|/[a-zA-Z0-9_-]+|\\^\\^?|,|,,?|!|@|:[0-9]+(:[0-9]+)?"

What is the purpose of this "/[a-zA-Z0-9_-]+"? It would mean, for
example that changes in alphanumeric path names would include the slash
in the changed part. I don't think this should be here.

> +	 /* Command-line options (to avoid splitting -option) */
> +	 "|--?[a-zA-Z0-9_-]+"
> +	 /* Brackets and grouping symbols */
> +	 "|\\(|\\)|\\{|\\}|\\[|\\]"),
>  PATTERNS("bibtex",
>  	 "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
>  	 /* -- */

-- Hannes