Re: [PATCH v2 1/1] userdiff: extend Bash pattern to cover more shell function forms

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 18 Feb 2025 11:30:23 -0800

Moumita <dhar61595@xxxxxxxxx> writes:

>  PATTERNS("bash",
> -	 /* Optional leading indentation */
> +     /* Optional leading indentation */

What is this change about?

>  	 "^[ \t]*"
> -	 /* Start of captured text */
> +	 /* Start of captured function name */
>  	 "("
>  	 "("
> -	     /* POSIX identifier with mandatory parentheses */
> -	     "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\))"
> +		 /* POSIX identifier with mandatory parentheses (allow spaces inside) */
> +		 "[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\\([ \t]*\\)"

Is indentation-change intended and required for this patch to work correctly?

>  	 "|"
> -	     /* Bashism identifier with optional parentheses */
> -	     "(function[ \t]+[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))"
> +		 /* Bash-style function definitions, allowing optional `function` keyword */
> +		 "(?:function[ \t]+(?=[a-zA-Z_]))?[a-zA-Z_][a-zA-Z0-9_]*(([ \t]*\\([ \t]*\\))|([ \t]+))?"

Ditto.

Regular expressions are write-only language; please make sure that
you do not add any unnecessary changes to distract eyes of
reviewers from spotting the _real_ changes that improves the current
codebase.

>  	 ")"
>  	 /* Optional whitespace */
>  	 "[ \t]*"
> -	 /* Compound command starting with `{`, `(`, `((` or `[[` */
> -	 "(\\{|\\(\\(?|\\[\\[)"
> -	 /* End of captured text */
> +	 /* Allow function body to start with `{`, `(` (subshell), `[[` */
> +	 "(\\{|\\(|\\[\\[)"
> +	 /* End of captured function name */
>  	 ")",

>  	 /* -- */
> -	 /* Characters not in the default $IFS value */
> -	 "[^ \t]+"),

We used to pretty-much use "a run of non-whitespace characters is a
token".  Now we are a bit more picky.

Which may or may not be good, but it is hard to tell if it is an
improvement.

> +	 /* Identifiers: variable and function names */
> +	 "[a-zA-Z_][a-zA-Z0-9_]*"
> +	 /* Numeric constants: integers and decimals */
> +	 "|[-+]?[0-9]+(\\.[0-9]*)?|[-+]?\\.[0-9]+"
> +	 /* Shell variables: `$VAR`, `${VAR}` */
> +	 "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}"
> +	 /* Logical and comparison operators */
> +	 "|\\|\\||&&|<<|>>|==|!=|<=|>="
> +	 /* Assignment and arithmetic operators */
> +	 "|[-+*/%&|^!=<>]=?"
> +	 /* Command-line options (to avoid splitting `-option`) */
> +	 "|--?[a-zA-Z0-9_-]+"
> +	 /* Brackets and grouping symbols */
> +	 "|\\(|\\)|\\{|\\}|\\[|\\]"),

The fact that this patch does not have any changes to "t/" hierarchy
suggests me that we do not have existing tests to see how sample
text files in the supported languages are tokenized (otherwise the
above changes would require adjusting such existing tests), so I
think it should be left outside of this topic, but I wonder if
adding such tests gives us a good way to demonstrate the effect of
these changes to userdiff patterns.

Thanks.