Re: [PATCH v4 1/1][GSOC] userdiff: extend Bash pattern to cover more shell function forms

MOUMITA DHAR <dhar61595@xxxxxxxxx> · Sat, 10 May 2025 17:07:30 +0530



On Tue, 6 May 2025 at 22:00, Johannes Sixt <j6t@xxxxxxxx> wrote:
>
> Am 30.03.25 um 15:39 schrieb Moumita:
> > From: Moumita Dhar <dhar61595@xxxxxxxxx>
> >
> > The previous function regex required explicit matching of function
> > bodies using `{`, `(`, `((`, or `[[`, which caused several issues:
> >
> > - It failed to capture valid functions where `{` was on the next line
> >   due to line continuation (`\`).
> > - It did not recognize functions with single  command body, such as
> >   `x () echo hello`.
>
> Good.
>
> > Replacing the function body matching logic with `.*$`, ensures
> > that everything on the function definition line is captured,
> > aligning with other userdiff drivers and improving hunk headers in
> > `git diff`.
>
> Personally, I am a bit allergic against marketing speak. *Of course* do
> we intend improve the code. No need to mention it.
>
> > Additionally, the word regex is refined to better recognize shell
> > syntax, including additional parameter expansion operators and
> > command-line options, improving syntax-aware diffs.
>
> Good. (But see above about "improving".)
>
> > Yes I will be careful about the commit messages.

> > Signed-off-by: Moumita Dhar <dhar61595@xxxxxxxxx>
> > ---
>
> > diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh
> > index f51d3557f1..0be647c2fb 100755
> > --- a/t/t4034-diff-words.sh
> > +++ b/t/t4034-diff-words.sh
> > @@ -320,6 +320,7 @@ test_expect_success 'unset default driver' '
> >
> >  test_language_driver ada
> >  test_language_driver bibtex
> > +test_language_driver bash
> >  test_language_driver cpp
> >  test_language_driver csharp
> >  test_language_driver css
>
> Including a test for word-diff is very much appreciated!
>
> > diff --git a/t/t4034/bash/expect b/t/t4034/bash/expect
> > new file mode 100644
> > index 0000000000..a0f7cbd5a3
> > --- /dev/null
> > +++ b/t/t4034/bash/expect
> > @@ -0,0 +1,30 @@
> > +<BOLD>diff --git a/pre b/post<RESET>
> > +<BOLD>index 09ac008..60ba6a2 100644<RESET>
> > +<BOLD>--- a/pre<RESET>
> > +<BOLD>+++ b/post<RESET>
> > +<CYAN>@@ -1,25 +1,25 @@<RESET>
> > +<RED>my_var<RESET><GREEN>new_var<RESET>=10
> > +x=<RED>123<RESET><GREEN>456<RESET>
> > +y=<RED>3.14<RESET><GREEN>2.71<RESET>
> > +z=<RED>.5<RESET><GREEN>.75<RESET>
>
> OK.
>
> > +echo <RED>$USER<RESET><GREEN>$USERNAME<RESET>
>
> This tests a typical variable expansion. Good.
>
> > +${<RED>HOME<RESET><GREEN>HOMEDIR<RESET>}
>
> A more elaborate variable expansion does not include the surrounding ${
> }. Good.
>
> > +if [ "<RED>$a<RESET><GREEN>$x<RESET>" == "<RED>$b<RESET><GREEN>$y<RESET>" ] || [ "<RED>$c<RESET><GREEN>$x<RESET>" != "<RED>$d<RESET><GREEN>$y<RESET>" ]; then echo "OK"; fi
>
> This line also only tests variable expansions and is quite redundant. It
> could test the operators that we see, but it doesn't. See below for
> ideas how to do that.
>
> And all from here...
>
> > +((<RED>a<RESET><GREEN>x<RESET>+=<RED>b<RESET><GREEN>y<RESET>))
> > +((<RED>a<RESET><GREEN>x<RESET>-=<RED>b<RESET><GREEN>y<RESET>))
> > +$((<RED>a<RESET><GREEN>x<RESET><<<RED>b<RESET><GREEN>y<RESET>))
> > +$((<RED>a<RESET><GREEN>x<RESET>>><RED>b<RESET><GREEN>y<RESET>))
> > +${<RED>a<RESET><GREEN>x<RESET>:-<RED>b<RESET><GREEN>y<RESET>}
> > +${<RED>a<RESET><GREEN>x<RESET>:=<RED>b<RESET><GREEN>y<RESET>}
> > +${<RED>a<RESET><GREEN>x<RESET>##*/}
> > +${<RED>a<RESET><GREEN>x<RESET>%.*}
> > +${<RED>a<RESET><GREEN>x<RESET>%%.*}
> > +${<RED>a<RESET><GREEN>x<RESET>^^}
> > +${<RED>a<RESET><GREEN>x<RESET>,}
> > +${<RED>a<RESET><GREEN>x<RESET>,,}
>
> ... to here also only test variable expansions, but not the operators.
> As written, they are totally redundant and should be dropped or improved.
>
> To test, for example, that '##' is kept together as an operator, you
> have to have
>
>    ${x#*/}
>
> in the pre-image and
>
>    ${x##*/}
>
> in the post-image, so that we see
>
>    ${x<RED>#<RESET><GREEN>##<RESET>*/}
>
> in the result. If '##' were two tokens (due to a bug in the pattern), we
> would see
>
>    ${x#<GREEN>#<RESET>*/}
>
> in the result.
>
> You should go through all not-single-character operators and have a
> pre-image and a post-image where one characters is added or removed. The
> following is not a correct test:
>
> pre:    ((x+=b))
> post:   ((x-=b))
> result: ((x<RED>+=<RESET><GREEN>-=<RESET>))
>
> because we would see this result even if the pattern has a bug that
> recognizes only one of += or -=, but would split the other one. A
> correct test would be:
>
> pre:    ((x+b))
> post:   ((x+=b))
> result: ((x<RED>+<RESET><GREEN>+=<RESET>))
>
> > +${!<RED>a<RESET><GREEN>x<RESET>}
>
> Here, you should have no '!' in the pre-image and the '!' in the post
> image and not change the name. Then the test demonstrates that '!' is
> its own token and not merged with '${' (if that is the intent of the
> test; otherwise it is a redunant test).
>
> > +${<RED>a<RESET><GREEN>x<RESET>[@]}
>
> If you want to test that '@' is not merged with the brackets, then you
> can have, for example '*' in the pre-image and '@' in the post image.

>
> > +${<RED>a<RESET><GREEN>x<RESET>:?error message}
>
> Either redundant or another two-character operator to test.

Ok I got it.
>
> > +${<RED>a<RESET><GREEN>x<RESET>:2:3}
>
> Redundant.
>
I understand I will remove it.

> > +ls <RED>-a<RESET><GREEN>-x<RESET>
> > +ls <RED>--a<RESET><GREEN>--x<RESET>
>
> Both are good tests.

Thank you. I understand the suggestions about the word diff tests and
I will remove the redundant ones and write the correct tests .
>
> > diff --git a/userdiff.c b/userdiff.c
> > index 340c4eb4f7..4c77c7e0f6 100644
> > --- a/userdiff.c
> > +++ b/userdiff.c
> > @@ -64,15 +64,27 @@ PATTERNS("bash",
> >            /* Bashism identifier with optional parentheses */
> >            "(function[ \t]+[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))"
> >        ")"
> > -      /* Optional whitespace */
> > -      "[ \t]*"
> > -      /* Compound command starting with `{`, `(`, `((` or `[[` */
> > -      "(\\{|\\(\\(?|\\[\\[)"
> > +      /* Everything after the function header is captured  */
> > +      ".*$"
>
> I remember suggesting to capture everything after the function header.
> However, If I am not mistaken, this does not do what I intended (and as
> written it means that a pointless matching operation happens). The hunk
> header shows everything that is in the outermost parentheses (group).
> This catch-all, however, is even outside the outermost group and not
> captured. It should be above the closing parenthesis that we see in the
> context.

> I am sorry I want to understand a thing  , we want everything from the function name to the end of the line to be the hunk header right ? So in the pattern "^[ \t]*"
/* Start of captured text */
"("
"("
     /* POSIX identifier with mandatory parentheses */
     "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\))"
"|"
     /* Bashism identifier with optional parentheses */
     "(function[ \t]+[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))"
")"
/* Optional whitespace */
"[ \t]*"
/* Compound command starting with `{`, `(`, `((` or `[[` */
"(\\{|\\(\\(?|\\[\\[)"
/* End of captured text */
")"
if I replace
"[ \t]*"
"(\\{|\\(\\(?|\\[\\[)"  part with .*$ then will it not capture the
entire line ? I mean the outermost group ends here-

 /* End of captured text */
")"
right ? What am I getting wrong ?

> To test for this, you can have one test where "RIGHT" is not the
> function name, but in the comment on the function header line.
>
> >        /* End of captured text */
> >        ")",
> >        /* -- */
> > -      /* Characters not in the default $IFS value */
> > -      "[^ \t]+"),
> > +      /* Identifiers: variable and function names */
> > +       "[a-zA-Z_][a-zA-Z0-9_]*"
> > +      /* Numeric constants: integers and decimals */
> > +       "|[0-9]+(\\.[0-9]*)?|[-+]?\\.[0-9]+"
> > +      /* Shell variables: $VAR, ${VAR} */
> > +       "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{"
> > +       /* Logical and comparison operators */
> > +      "|\\|\\||&&|<<|>>|==|!=|<=|>="
> > +      /* Assignment and arithmetic operators */
> > +      "|[-+*/%&|^!=<>]=?"
> > +      /* Additional parameter expansion operators */
> > +      "|:?=|:-|:\\+|:\\?|:|#|##|%|%%|/[a-zA-Z0-9_-]+|\\^\\^?|,|,,?|!|@|:[0-9]+(:[0-9]+)?"
>
> What is the purpose of this "/[a-zA-Z0-9_-]+"? It would mean, for
> example that changes in alphanumeric path names would include the slash
> in the changed part. I don't think this should be here.

> Ok I got it.

> > +      /* Command-line options (to avoid splitting -option) */
> > +      "|--?[a-zA-Z0-9_-]+"
> > +      /* Brackets and grouping symbols */
> > +      "|\\(|\\)|\\{|\\}|\\[|\\]"),
> >  PATTERNS("bibtex",
> >        "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
> >        /* -- */
>
> -- Hannes
>