Re: [GSoC] Project Proposal: Machine-Readable Repository Information Query Tool

Patrick Steinhardt <ps@xxxxxx> · Thu, 3 Apr 2025 12:14:02 +0200

On Wed, Apr 02, 2025 at 03:22:11PM -0300, Lucas Seiki Oshiro wrote:
> ### Activity in the Git community in 2025
> 
> Since when I decided to submit a proposal for GSoC, I sent some patches
> to the Git codebase and git.github.io:
> 
> - My microproject, replacing some `test -f` by `test_path_is_file`:
>   https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@xxxxxxxxx/;
> 
> - Adding a paragraph to the merge-strategies documentation describing how
>   Git merges submodules (based on the blog post that I mentioned
>   before):
>   https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@xxxxxxxxx/;
>   
> - A patchset adding a new `--subject-extra-prefix` flag for `git
>   format-patch`, allowing the user to quickly prepend tags like [GSoC],
>   [Newbie] or [Outreachy] to the beginning of the subject. This patchset
>   was rejected in favor of just using `--subject-prefix='GSoC PATCH'` or
>   similar. It can be seen here:
>   https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@xxxxxxxxx/;
> 
> - Given the feedback on the previous rejected patchset, I opened a Pull
>   Request on git.github.io replacing the occurrences of `[GSoC][PATCH]`
>   by `[GSoC PATCH]`;
>   
> - Adding a new userdiff driver for INI files, initially target for
>   gitconfig files. Currently it is still under revision:
>   https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@xxxxxxxxx/.
> 
> Beyond contributions, I also helped people on the mailing list that
> needed assistance on Git documentation.

Could you please also amend the status (merged to master, merged to
next, under discussion) for each of these items?

> ## Project Proposal
> 
> Based on the information provided in
> https://git.github.io/SoC-2025-Ideas/, the goal of this project is to
> create a new Git command for querying information from a repository and
> returning it as a semi-structured data format as a JSON output.
> 
> In the scope of this project, the JSON output will only include data
> that can currently be retrieved through existing Git commands, for
> example:
> 
> - `git branch`: information about branches, such as the commit that each
>   branch currently references and their upstreams;
> 
> - `git tag`: information about the tags, such as the author or commit
>   date and the messages they hold (in the case of annotated tags);
> 
> - `git remote`: the URL of each remote;
> 
> - `git log`: statistics about the commit history, such of the
>   distribution of commits over time and by author, the distribution of
>   lines changed by each author;
> 
> - `git submodule`: information about the submodules, mainly the commits
>   that they are referencing and their remote URLs;
> 
> - `git rev-parse`: the current branch name, the current commit, the path
>   of the repository top level directory, if the repository is a bare
>   repository or if the repository is under bisection.
> 
> Given that the information that we want to compile are currently
> accessible only through different commands with different sets of flags,
> the user that wants to read them needs to have an advanced knowledge on
> Git. Once having the repository details consolidated in a single
> command, the user will be able to quickly retrieve what it desires
> without navigating a complex combination of commands and flags.

I already noticed in another proposal, but it seems a bit like the idea
is underspecced. The idea isn't to make _all_ information about the
repository accessible. It's rather that we want to give a better home to
information about the underlying repository itself. To clarify further,
I'm talking about information like:

  - Which object hash does the repository use?
  - What is the ref database format?
  - Where is the Git directory?
  - Where is the common directory?
  - What is the top-level directory?

This kind of information is exposed via git-rev-parse(1) already, see
the section "Options for Files". But git-rev-parse(1) is not really a
good match at all given that its main intent is to parse revisions. Over
time though it developed into a kind of grab-bag of different unrelated
functionality that we didn't really have a nice home for elsewhere.

> ### Development plan
> 
> Since this is a new command that is not directly related to any specific
> existent command, it will probably be placed in a new file inside the
> `builtin` directory.
> 
> The functionality of this command can be divided into two categories:
> 
> 1. **Data gathering**: retrieving data from different sources, calling
>    existent functions and reading data structures declared in other
>    files;
> 
> 2. **Data serialization**: formatting the gathered data in a JSON
>    format. This represents two challenges: generating the JSON itself
>    and designing the schema for how the desired data will be presented.
>    
> Since the exported data is already provided by other Git commands, it
> probably won't be difficult to implement this side of the
> functionality. The main task would be inspecting the existing codebase
> and find the functions and data structures that will feed our output.
> 
> Designing the schema, however, requires special planning, as the
> flexibility of semi-structured data like JSON may lead to early
> bad decisions. A solution may emerge by analysing other software that
> export JSON as metadata.
> 
> ### Schedule
> 
> 1. **Now -- May 5th**: Requirements gathering
>    - Inspect codebases that uses Git as data sources; 
>    - Contacting academic researchers on FLOSS;
>    - Contacting industry infrastructure professionals;
> 
> 2. **May 6th -- June 1st**: Community bonding
>    - Getting in touch with the mentors;
>    - Present to the community a first proposal of the JSON schema;
>    - Receive feedback from the community about the schema;
>    - Present a first proposal on the command line interface;
>    - Receive feedback from the community about the command line
>      interface;
> 
> 3. **June 2nd -- July 14th**: First coding round
>    - Write data structures that correspond to the presented JSON schema;
>    - Fill the data structures with data obtained from routines of the
>      existing codebase;
> 
> 4. **July 15th -- August 25th**: Second coding round
>    - Implementing the command line interface option handlers;
>    - Write the JSON serializer.

I generally recommend students to take on smaller batches of work that
can be submitted individually. The way it is structured now means that
you will end up with a single deliverable at the end of your project.
But structuring the project like that introduces a high risk that you
won't be able to land anything until the end of your project in case
there is a bigger discussion around parts of these patches.

Instead, it would make sense to identify smaller batches of work that
are self-contained enough to be submitted upstream. This ensures that
you get early feedback and that you can iterate on your design as early
as possible in the project.

Patrick