On Wed, Apr 02, 2025 at 03:22:11PM -0300, Lucas Seiki Oshiro wrote: > ### Activity in the Git community in 2025 > > Since when I decided to submit a proposal for GSoC, I sent some patches > to the Git codebase and git.github.io: > > - My microproject, replacing some `test -f` by `test_path_is_file`: > https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@xxxxxxxxx/; > > - Adding a paragraph to the merge-strategies documentation describing how > Git merges submodules (based on the blog post that I mentioned > before): > https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@xxxxxxxxx/; > > - A patchset adding a new `--subject-extra-prefix` flag for `git > format-patch`, allowing the user to quickly prepend tags like [GSoC], > [Newbie] or [Outreachy] to the beginning of the subject. This patchset > was rejected in favor of just using `--subject-prefix='GSoC PATCH'` or > similar. It can be seen here: > https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@xxxxxxxxx/; > > - Given the feedback on the previous rejected patchset, I opened a Pull > Request on git.github.io replacing the occurrences of `[GSoC][PATCH]` > by `[GSoC PATCH]`; > > - Adding a new userdiff driver for INI files, initially target for > gitconfig files. Currently it is still under revision: > https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@xxxxxxxxx/. > > Beyond contributions, I also helped people on the mailing list that > needed assistance on Git documentation. Could you please also amend the status (merged to master, merged to next, under discussion) for each of these items? > ## Project Proposal > > Based on the information provided in > https://git.github.io/SoC-2025-Ideas/, the goal of this project is to > create a new Git command for querying information from a repository and > returning it as a semi-structured data format as a JSON output. > > In the scope of this project, the JSON output will only include data > that can currently be retrieved through existing Git commands, for > example: > > - `git branch`: information about branches, such as the commit that each > branch currently references and their upstreams; > > - `git tag`: information about the tags, such as the author or commit > date and the messages they hold (in the case of annotated tags); > > - `git remote`: the URL of each remote; > > - `git log`: statistics about the commit history, such of the > distribution of commits over time and by author, the distribution of > lines changed by each author; > > - `git submodule`: information about the submodules, mainly the commits > that they are referencing and their remote URLs; > > - `git rev-parse`: the current branch name, the current commit, the path > of the repository top level directory, if the repository is a bare > repository or if the repository is under bisection. > > Given that the information that we want to compile are currently > accessible only through different commands with different sets of flags, > the user that wants to read them needs to have an advanced knowledge on > Git. Once having the repository details consolidated in a single > command, the user will be able to quickly retrieve what it desires > without navigating a complex combination of commands and flags. I already noticed in another proposal, but it seems a bit like the idea is underspecced. The idea isn't to make _all_ information about the repository accessible. It's rather that we want to give a better home to information about the underlying repository itself. To clarify further, I'm talking about information like: - Which object hash does the repository use? - What is the ref database format? - Where is the Git directory? - Where is the common directory? - What is the top-level directory? This kind of information is exposed via git-rev-parse(1) already, see the section "Options for Files". But git-rev-parse(1) is not really a good match at all given that its main intent is to parse revisions. Over time though it developed into a kind of grab-bag of different unrelated functionality that we didn't really have a nice home for elsewhere. > ### Development plan > > Since this is a new command that is not directly related to any specific > existent command, it will probably be placed in a new file inside the > `builtin` directory. > > The functionality of this command can be divided into two categories: > > 1. **Data gathering**: retrieving data from different sources, calling > existent functions and reading data structures declared in other > files; > > 2. **Data serialization**: formatting the gathered data in a JSON > format. This represents two challenges: generating the JSON itself > and designing the schema for how the desired data will be presented. > > Since the exported data is already provided by other Git commands, it > probably won't be difficult to implement this side of the > functionality. The main task would be inspecting the existing codebase > and find the functions and data structures that will feed our output. > > Designing the schema, however, requires special planning, as the > flexibility of semi-structured data like JSON may lead to early > bad decisions. A solution may emerge by analysing other software that > export JSON as metadata. > > ### Schedule > > 1. **Now -- May 5th**: Requirements gathering > - Inspect codebases that uses Git as data sources; > - Contacting academic researchers on FLOSS; > - Contacting industry infrastructure professionals; > > 2. **May 6th -- June 1st**: Community bonding > - Getting in touch with the mentors; > - Present to the community a first proposal of the JSON schema; > - Receive feedback from the community about the schema; > - Present a first proposal on the command line interface; > - Receive feedback from the community about the command line > interface; > > 3. **June 2nd -- July 14th**: First coding round > - Write data structures that correspond to the presented JSON schema; > - Fill the data structures with data obtained from routines of the > existing codebase; > > 4. **July 15th -- August 25th**: Second coding round > - Implementing the command line interface option handlers; > - Write the JSON serializer. I generally recommend students to take on smaller batches of work that can be submitted individually. The way it is structured now means that you will end up with a single deliverable at the end of your project. But structuring the project like that introduces a high risk that you won't be able to land anything until the end of your project in case there is a bigger discussion around parts of these patches. Instead, it would make sense to identify smaller batches of work that are self-contained enough to be submitted upstream. This ensures that you get early feedback and that you can iterate on your design as early as possible in the project. Patrick