[GSoC] Project Proposal: Machine-Readable Repository Information Query Tool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

As you may noticed by the my interactions here, I'm going to send a 
proposal for GSoC 2025!

I'm interested in the project idea currently entitled "Project Proposal:
Machine-Readable Repository Information Query Tool". My main motivations
on why I have chosen this idea is because I think it will be useful for
infrastructure teams and FLOSS researchers.

I'm sending here first version of my proposal. I'll be grateful if you
send me feedback on it! In this proposal I'm presenting myself again,
the possible use cases of this feature, a first idea on how it would
work and a activity schedule.

Thanks!

---


# Machine-Readable Repository Information Query Tool

## Contact info

- Name: Lucas Seiki Oshiro
- Timezone: GMT-3
- IRC:
- GitHub: https://github.com/lucasoshiro
- LinkedIn: https://www.linkedin.com/in/lucasseikioshiro/

## About me

My name is Lucas Oshiro, I'm a developer and CS bachelor from São Paulo,
Brazil. Currently I'm pursuing a master degree in CS at University of São
Paulo. My interest in Git dates from years ago and I even submitted a
patch to its codebase in the past, though I couldn't complete it due to
scheduling conflicts with my capstone project.

Having experience in the academia, industry and FLOSS, I highly value
code quality, code legibility, well-maintained Git histories, unit tests
and documentation.

### Previous experience with Git

Before this year, I haven't been involved directly with Git community,
however, I kept my interest in Git alive by:

- Translating the "Git Internals" chapter of Pro Git to Brazilian
  Portuguese: https://github.com/progit/progit2-pt-br/pull/81;

- Writing some blog posts about Git, for example:
  - one explaining how Git can be used as a debugging tool:
    https://lucasoshiro.github.io/posts-en/2023-02-13-git-debug/;

  - other explaining how Git merge submodules:
  https://lucasoshiro.github.io/posts-en/2022-03-12-merge-submodule/;

- Writing a compatible subset of Git in Haskell from scratch:
 https://github.com/lucasoshiro/oshit;

- Helping organizing a Git Introductory Workshop at my University:
  https://flusp.ime.usp.br/events/git-introductory-workshop/;

- Presenting some lectures about Git in a company that I worked some
  years ago, covering the Git internals (objects, references, packfile)
  and debugging and archaeology related Git tools (blame, bisect,
  pickaxe, ls-files, etc).

### Previous experience with C and open-source

I also have experience with C and some C++. During my CS
course, C was one of the primary languages that I used. I also
worked with C/C++, for example, in:

- Writing an AMQP message broker from scratch: 
  https://github.com/lucasoshiro/amqp_broker;

- Contributing with simple patches to the IIO subsystem of the Linux
  kernel: https://lucasoshiro.github.io/floss-en/2020-06-06-kernel_linux/;

- Contributing to the Marlin firmware for 3D printers:
  https://lucasoshiro.github.io/floss-en/2020-06-30-Marlin/;

- Writing a module for the ns-3 network simulator, dealing with both C
  and C++ codebases (currently under development, I plan to write a
  paper and make the code available soon);

During my CS course I also was member of FLUSP
(https://flusp.ime.usp.br), a group in my university focused on FLOSS
contributions and from Hardware Livre USP
(https://hardwarelivreusp.org), another group that was focused on
working with open-source hardware.

As a master's student, I'm one of the Open Science Ambassadors of my
University (https://cienciaaberta.usp.br/sobre-o-projeto/, in
Portuguese), promoting the Open Science principles, which include
open-source software, in the unit where I study.

I also contributed to some other free/open-source software, which I list
here: https://lucasoshiro.github.io/floss-en/

### Activity in the Git community in 2025

Since when I decided to submit a proposal for GSoC, I sent some patches
to the Git codebase and git.github.io:

- My microproject, replacing some `test -f` by `test_path_is_file`:
  https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@xxxxxxxxx/;

- Adding a paragraph to the merge-strategies documentation describing how
  Git merges submodules (based on the blog post that I mentioned
  before):
  https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@xxxxxxxxx/;
  
- A patchset adding a new `--subject-extra-prefix` flag for `git
  format-patch`, allowing the user to quickly prepend tags like [GSoC],
  [Newbie] or [Outreachy] to the beginning of the subject. This patchset
  was rejected in favor of just using `--subject-prefix='GSoC PATCH'` or
  similar. It can be seen here:
  https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@xxxxxxxxx/;

- Given the feedback on the previous rejected patchset, I opened a Pull
  Request on git.github.io replacing the occurrences of `[GSoC][PATCH]`
  by `[GSoC PATCH]`;
  
- Adding a new userdiff driver for INI files, initially target for
  gitconfig files. Currently it is still under revision:
  https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@xxxxxxxxx/.

Beyond contributions, I also helped people on the mailing list that
needed assistance on Git documentation.

## Project Proposal

Based on the information provided in
https://git.github.io/SoC-2025-Ideas/, the goal of this project is to
create a new Git command for querying information from a repository and
returning it as a semi-structured data format as a JSON output.

In the scope of this project, the JSON output will only include data
that can currently be retrieved through existing Git commands, for
example:

- `git branch`: information about branches, such as the commit that each
  branch currently references and their upstreams;

- `git tag`: information about the tags, such as the author or commit
  date and the messages they hold (in the case of annotated tags);

- `git remote`: the URL of each remote;

- `git log`: statistics about the commit history, such of the
  distribution of commits over time and by author, the distribution of
  lines changed by each author;

- `git submodule`: information about the submodules, mainly the commits
  that they are referencing and their remote URLs;

- `git rev-parse`: the current branch name, the current commit, the path
  of the repository top level directory, if the repository is a bare
  repository or if the repository is under bisection.

Given that the information that we want to compile are currently
accessible only through different commands with different sets of flags,
the user that wants to read them needs to have an advanced knowledge on
Git. Once having the repository details consolidated in a single
command, the user will be able to quickly retrieve what it desires
without navigating a complex combination of commands and flags.

### Use cases

Some use cases that will be benefited of this feature will be:

- CLI tools that display formatted information about a Git repository,
  for example, OneFetch (https://github.com/o2sh/onefetch);

- Text editors, IDEs and plugins that have front-ends for Git, such as
  Magit (https://magit.vc) or GitLens (https://www.gitkraken.com/gitlens);

- FLOSS repository tracking software, for example,
  kworkflow (https://github.com/kworkflow),
  ctracker (https://github.com/quic/contribution-tracker);

- Academic researchers on FLOSS projects that need statistics on the
  repositories that they are querying;

- Continuous integration workflows that perform checks on the
  repository before allowing a branch to be merged into another or
  before a deploy;

- Code quality tools that will be able to inspect the health of the
  commit history.

### Planned features

Since the features haven't been defined yet, this will need to be
planned after surveying people and projects that potentially will use
that:

- Searching on code hosting tools (e.g. GitHub, GitLab) for open-source
  software that retrieve data from Git and what they do with them;
  
- Contacting people in academia that use Git repositories as data
  sources for their researches and find out what valuable information
  this command can provide them;
  
- Contacting people from the industry, specially in infrastructure teams
  to understand the challenges they face when retrieving data from Git.
  
Given that I have worked in a infrastructure team and that I have
colleagues and professors at the university that currently research
FLOSS software and communities, I have contacts that can provide input
on what should be considered when developing this new command.

By now, it's not possible to decide how exactly this command would work,
but a first draft is this (supposing that `metadata` is the name of the
command and `--submodule` is a flag that enable the submodule metadata):

~~~
$ git metadata --submodule

{
  "symbolic_refs": {
    "HEAD": "main"
  },
  "branches": [
    {
      "name": "main",
      "commit_id": "ac72c22f3c8a9280c81171ccc6cedff3171344cf",
      "remote": "origin/main"
    },
    {
      "name": "feature",
      "commit_id": "1e373e02767337bd6b996da6598eed822a805878",
      "remote": "fork/feature"
    }
  ],
  "tags": [
    {
      "name": "v1.0",
      "message": "First version",
      "author_timestamp": "1743554265",
      "commiter_timestamp": "1743554265"
    }
  ],
  "remotes": [
    {
      "name": "origin",
      "url": "https://example.com/foo";
    },
    {
      "name": "fork",
      "url": "user@xxxxxxxxxxx/foo"
    }
  ],
  "submodules": [
    {
      "path": "my_dir/my_submodule_dir",
      "url": "https://example.com/bar";,
      "commit_id": "94436069f106c0014897b1c93e8fc3e49c8fc156"
    }
  ]
}
~~~

### Development plan

Since this is a new command that is not directly related to any specific
existent command, it will probably be placed in a new file inside the
`builtin` directory.

The functionality of this command can be divided into two categories:

1. **Data gathering**: retrieving data from different sources, calling
   existent functions and reading data structures declared in other
   files;

2. **Data serialization**: formatting the gathered data in a JSON
   format. This represents two challenges: generating the JSON itself
   and designing the schema for how the desired data will be presented.
   
Since the exported data is already provided by other Git commands, it
probably won't be difficult to implement this side of the
functionality. The main task would be inspecting the existing codebase
and find the functions and data structures that will feed our output.

Designing the schema, however, requires special planning, as the
flexibility of semi-structured data like JSON may lead to early
bad decisions. A solution may emerge by analysing other software that
export JSON as metadata.

### Schedule

1. **Now -- May 5th**: Requirements gathering
   - Inspect codebases that uses Git as data sources; 
   - Contacting academic researchers on FLOSS;
   - Contacting industry infrastructure professionals;

2. **May 6th -- June 1st**: Community bonding
   - Getting in touch with the mentors;
   - Present to the community a first proposal of the JSON schema;
   - Receive feedback from the community about the schema;
   - Present a first proposal on the command line interface;
   - Receive feedback from the community about the command line
     interface;

3. **June 2nd -- July 14th**: First coding round
   - Write data structures that correspond to the presented JSON schema;
   - Fill the data structures with data obtained from routines of the
     existing codebase;

4. **July 15th -- August 25th**: Second coding round
   - Implementing the command line interface option handlers;
   - Write the JSON serializer.

### Availability

2025 is my last year in my master's degree. Currently, I'm not attending
any classes and I am more focused on developing the software of my
research, performing experiments and writing scientific articles and my
thesis. Since my advisor is aware that I'm proposing a GSoC project, it
will be possible to work on Git while working on my master's tasks.






[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux