Re: [GSoC PATCH v9 2/5] repo: add the field references.format

Eric Sunshine <sunshine@xxxxxxxxxxxxxx> · Wed, 13 Aug 2025 17:46:15 -0400

On Wed, Aug 13, 2025 at 5:18 PM Lucas Seiki Oshiro
<lucasseikioshiro@xxxxxxxxx> wrote:
> > What's the reason for this?
>
> Basically, filter out duplicated keys. This is also helpful for not
> repeating the same "key not found" multiple times, as suggested by
> Eric [1].

The suggestion you cite has relevance only as long as deduplication is
the chosen implementation scheme, however, Phillip is arguing that the
deduplication and key reordering logic should be dropped, hence, the
cited reference isn't relevant in light of Phillip's suggestion.

> > If I query three keys from a script then it is much easier to parse
> > the output if I know the keys are going to appear in the same order
> > that they were on the command line.
>
> This assumption would be a little bit broken as one can ask an invalid
> key. In this case, this command will print the error to stderr, and
> proceed to the next value.

Yes and no. While it's true that a caller might ask for an invalid
key, the primary (and useful) purpose of this command is to facilitate
scripting. Once the script author has "debugged" the call to `git
repo`, then the output will be predictable. Hence, although you make a
fair point, it's not a strong argument against Phillip's
recommendation to drop the deduplication and key re-ordering logic.

> > If the command re-orders them my script now has to check the value of
> > each key which results in a bunch of unnecessary string comparisons
> > because it cannot determine the key from the position in the output.
>
> In cases where the client don't want to compare strings, it is still
> possible to ask one key at time, just like other Git commands (e.g.
> git var, git config). Since this command won't return too many values,
> it would be ok even if the user requests all the possible keys.

Generally speaking, process creation is slow. Process creation on
Microsoft Windows is especially slow, excruciatingly so. Authors of
tooling around Git often pay close attention to such matters because
they don't want the functionality provided by their tooling to be
slow, so we ought to be weary of a counterargument (such as the one
above) which suggests simply running the command multiple times, once
for each item.

> > While we were producing json output there was a need to de-duplicate
> > the keys when that output format was selected. However, we no-longer
> > produce json and in any case de-duplication could have been achieved
> > without sorting the input keys by using a hash table, or, as there is
> > a small fixed number of keys, an array that records the keys we've
> > already seen.
>
> I still think that it would over-engineer this command.

I don't think that Phillip was suggesting dropping only the reordering
while keeping the deduplication; he was merely giving an example of an
alternative implementation which would accomplish the deduplication
goal, so he wasn't asking to over-engineer. Instead, (according to my
reading), he is suggesting dropping both deduplication and reordering.

> If I follow
> this path of returning the values in the same order they were in the
> command line, I think it would be better to just allow duplicated keys
> and multiple "key not found" errors for the same unknown key instead
> of increasing the complexity of this command.
>
> What do you think?

I think that's exactly what Phillip was suggesting: present output in
order requested, no deduplication

I had suggested the same back in [*], but I also said that I could
formulate arguments in favor of either behavior, so I didn't have a
strong opinion. However, Phillip has presented a good reason to prefer
"output in order requested, no deduplication", and I do find his
argument compelling.

[*]: https://lore.kernel.org/git/CAPig+cTuiUy=+2Jf1Lrp1gaM03_zPf8EFMVSKmShqU05t-3aWQ@xxxxxxxxxxxxxx/