Lucas Seiki Oshiro <lucasseikioshiro@xxxxxxxxx> writes: > Hello again! > > I'm sending this v3 which is basically the v2 with some polishments. > You can skip the v2 and review this directly. So, the changes compared > to v1 are: Thanks Lucas, for your proposal. I would recommend in-lining new versions in the same thread, it makes it easier to review and also to note what comments were left in previous versions. That said, I just went through v1 and skipped v2 as you suggested. Reading on. > > - Detailing the status of my patches; > - Focusing in `rev-parse` features, listing all that Patrick suggested and > searching for a few more that may be useful; > - Given it is more focused on rev-parse, the parts about surveying users > and projects for features was reduced but not entirely suppressed, as I'll > need to find __where__ and __why__ this new command will be important. But > now this will not be as important I as thought it would be; > - Changed the expected output, which becomes very smaller compared to what I > thought it would be; > - Making the name of the new command explicit; > - Making it clear that the JSON-related functionality will be placed in a > separated file; > - Detailing a little bit more some examples of functions that I expect to use > for retrieving data to populate the JSON that will be outputted; > - Breaking the schedule into 6 steps, bringing part the development of the > JSON serializer to the beginning. > > --- > > # Machine-Readable Repository Information Query Tool > > ## Contact info > > - Name: Lucas Seiki Oshiro > - Timezone: GMT-3 (America/São Paulo) > - IRC: lucasoshiro > - Personal page: https://lucasoshiro.github.io/en/ > - GitHub: https://github.com/lucasoshiro > - LinkedIn: https://www.linkedin.com/in/lucasseikioshiro/ > > ## About me > > My name is Lucas Oshiro, I'm a developer and CS bachelor from São Paulo, > Brazil. Currently I'm pursuing a master degree in CS at University of São > Paulo. My interest in Git dates from years ago and I even submitted a > patch to its codebase in the past, though I couldn't complete it due to > scheduling conflicts with my capstone project. > > Having experience in the academia, industry and FLOSS, I highly value > code quality, code legibility, well-maintained Git histories, unit tests > and documentation. > > ### Previous experience with Git > > Before this year, I haven't been involved directly with Git community, > however, I kept my interest in Git alive by: > > - Translating the "Git Internals" chapter of Pro Git to Brazilian > Portuguese: https://github.com/progit/progit2-pt-br/pull/81; > > - Writing some blog posts about Git, for example: > - one explaining how Git can be used as a debugging tool: > https://lucasoshiro.github.io/posts-en/2023-02-13-git-debug/; > This was a good read! Really nice! > - other explaining how Git merge submodules: > https://lucasoshiro.github.io/posts-en/2022-03-12-merge-submodule/; > > - Writing a compatible subset of Git in Haskell from scratch: > https://github.com/lucasoshiro/oshit; > > - Helping organizing a Git Introductory Workshop at my University: > https://flusp.ime.usp.br/events/git-introductory-workshop/; > > - Presenting some lectures about Git in a company that I worked some > years ago, covering the Git internals (objects, references, packfile) > and debugging and archaeology related Git tools (blame, bisect, > pickaxe, ls-files, etc). > > ### Previous experience with C and open-source > > I also have experience with C and some C++. During my CS > course, C was one of the primary languages that I used. I also > worked with C/C++, for example, in: > > - Writing an AMQP message broker from scratch: > https://github.com/lucasoshiro/amqp_broker; > > - Contributing with simple patches to the IIO subsystem of the Linux > kernel: https://lucasoshiro.github.io/floss-en/2020-06-06-kernel_linux/; > > - Contributing to the Marlin firmware for 3D printers: > https://lucasoshiro.github.io/floss-en/2020-06-30-Marlin/; > > - Writing a module for the ns-3 network simulator, dealing with both C > and C++ codebases (currently under development, I plan to write a > paper and make the code available soon); > > During my CS course I also was member of FLUSP > (https://flusp.ime.usp.br), a group in my university focused on FLOSS > contributions and from Hardware Livre USP > (https://hardwarelivreusp.org), another group that was focused on > working with open-source hardware. > > As a master's student, I'm one of the Open Science Ambassadors of my > University (https://cienciaaberta.usp.br/sobre-o-projeto/, in > Portuguese), promoting the Open Science principles, which include > open-source software, in the unit where I study. > > I also contributed to some other free/open-source software, which I list > here: https://lucasoshiro.github.io/floss-en/ > > ### Activity in the Git community in 2025 > > Since when I decided to submit a proposal for GSoC, I sent some patches > to the Git codebase and git.github.io: > > - My microproject, replacing some `test -f` by `test_path_is_file`: > https://lore.kernel.org/git/20250208165731.78804-1-lucasseikioshiro@xxxxxxxxx/, > merged to master; > > - Adding a paragraph to the merge-strategies documentation describing how > Git merges submodules (based on the blog post that I mentioned > before): > https://lore.kernel.org/git/20250227014406.20527-1-lucasseikioshiro@xxxxxxxxx/, > merge to master; > > - A patchset adding a new `--subject-extra-prefix` flag for `git format-patch`, > allowing the user to quickly prepend tags like [GSoC], [Newbie] or [Outreachy] > to the beginning of the subject: > https://lore.kernel.org/git/20250303220029.10716-1-lucasseikioshiro@xxxxxxxxx/. > This patchset was rejected in favor of just using `--subject-prefix='GSoC > PATCH'` or similar; > > - Given the feedback on the previous rejected patchset, I opened a Pull > Request on git.github.io replacing the occurrences of `[GSoC][PATCH]` > by `[GSoC PATCH]`, merged to master; > > - Adding a new userdiff driver for INI files, initially target for > gitconfig files: > https://lore.kernel.org/git/20250331031309.94682-1-lucasseikioshiro@xxxxxxxxx/. > Currently it is still under revision. > > Beyond contributions, I also helped people on the mailing list that > needed assistance on Git features and documentation. > > ## Project Proposal > > Based on the information provided in > https://git.github.io/SoC-2025-Ideas/, the goal of this project is to > create a new Git command for querying information from a repository and > returning it as a semi-structured data format as a JSON output. > > A first idea on how this command would be named is `git metadata`. One thing to keep in mind is we already have: - git status - git describe Both of them are used to provide summary about the repository in some sense. How do we differentiate between these two and the new command. Rhetorical: - Does 'git metadata' differentiate itself enough to imply what it does? - Does it convey that it should be used to retrieve repository information? - Should we consider 'git repo-info', 'git info', 'git context', 'git repo-query'? It would be nice to add some thought here, perhaps justifying why the chosen name is chosen. > > In the scope of this project, the JSON output will only include data > that can currently be retrieved through existing Git commands. The main > idea is to centralize data from `git rev-parse`, which currently is > overloaded with features that doesn't fit its main purpose. > We know that 'git rev-parse' outputs in a human readable, would that be the default here too? > Some of the data that we expect to retrieve and centralize are: > There are a lot more options under the 'Options for Files' section of the 'git rev-parse' manpage, it would nice to highlight that this is mostly what we're looking at. > - The hashing algorithm (i.e. `sha1` or `sha256`), which currently can > be retrieved using `git rev-parse --show-object-format`; > > - The Git directory of the repository, currently retrieved by running > `git rev-parse --git-dir`; > > - The common Git directory, currently retrieved by running > `--git-common-dir`; > > - The top level directory of the repository, currently retrieved by using > `git rev-parse --show-toplevel`; > > - The reference database format (i.e. currently, `files` or `reftable`), > currently retrieved by running `git rev-parse --show-ref-format`; > > - The absolute path of the superproject, currently retrieved by running > `git rev-parse --show-superproject-working-tree`; > > - Whether this is a bare repository, currently retrieved by running > `git --is-bare-repository`; > > - Whether this is a shallow repository, currently retrieved by running > `git --is-shallow-repository`. > > Given that the information that we want to compile are currently > accessible with different sets of flags, the user that wants to read > them needs to have an advanced knowledge on Git. Once having the > repository details consolidated in a single command, the user will be > able to quickly retrieve what it desires without navigating a complex > combination of commands and flags. > > A side effect is decreasing the reponsibility of `rev-parse`, like > `git switch` and `git restore` did for `git checkout`. > > ### Use cases > > Some use cases that will be benefited of `git metadata` will be: > > - CLI tools that display formatted information about a Git repository, > for example, OneFetch (https://github.com/o2sh/onefetch); > > - Text editors, IDEs and plugins that have front-ends for Git, such as > Magit (https://magit.vc) or GitLens > (https://www.gitkraken.com/gitlens); > > - FLOSS repository tracking software, for example, > kworkflow (https://github.com/kworkflow), > ctracker (https://github.com/quic/contribution-tracker); > > - Any other tool that integrates with Git and currently relies on > `rev-parse` to get information about the repository; > > ### Planned features > > `git metadata` consists of one big feature: produce the JSON with > repository metadata. At first, this should be populated with the data > that currently can only be retrieved through `git rev-parse`, like the > ones listed before. > > Other data may be added depending on the demands of the Git community > and the user base. > > By now, it's not possible to decide precisely how this command would > work without it being more discussed. But a first draft on how it would > be invoked and the output that it will produce is: > > ~~~ > $ git metadata > > { > "object-format": "sha1", > "git-dir": ".git", > "common-dir": ".", > "toplevel": "/home/user/my_repo", > "ref-format": "files", > "superproject-working-tree": "/home/user/my_super_repo", > "bare-repository": true, > "shallow-repository": false > } It would be nice to add subsections maybe: '{"refs": {...}, ..., "objects": {...}}' > ~~~ > > This first draft will be sent to the mailing list in order to get > feedback from the Git community. > > ### Development plan > > Since this is a new command that is not directly related to any specific > existent command, its main code will probably be placed in a new file > `builtin/metadata.c`. > > Given that this project will give JSON-related functionality to Git, a > new `json.c` file in the top level directory of the codebase will be > created and it will be available for being used by `git metadata` and > any other command that want to reuse the JSON features introduced here. > > The functionality `git metadata` can be divided into two categories: > > 1. **Data gathering**: retrieving data from different sources, calling > existent functions and reading data structures declared in other > files; > > 2. **Data serialization**: formatting the gathered data in a JSON > format. This represents two challenges: generating the JSON itself > and designing the schema for how the desired data will be presented. > > Since the exported data is already provided by other Git commands, it > won't be difficult to implement this side of the functionality. The main > task for gathering data will be inspect the existing codebase and find > the functions and data structures that will feed our output. For > example, the already mentioned data: > > - Hashing algorithm: `git rev-parse` reads from the field > `the_repository->hash_algo->name` (aka `the_hash_algo->name`); > > - The Git directory, the common Git directory and top level directory: > `git rev-parse` formats those paths using `print_path`, based on the > `prefix` parameter and the fields `commondir` and from > `the_repository`; > > - Reference database format: `git rev-parse` retrieves it from > `ref_storage_format_to_name`; > > - Absolute path of the superproject: `git rev-parse` retrieves it from > the function `get_superproject_working_tree`; > > - Whether the repository is bare: `git rev-parse` retrieves it from the > function `is_bare_repository`; > - Whether the repository is shallow: `git rev-parse` retrieves it from > the function `is_repository_shallow`. > > Designing the schema, however, requires special planning, as the > flexibility of semi-structured data like JSON may lead to early > bad decisions. A solution may emerge by analysing other software that > export JSON as metadata. > > ### Schedule > > 1. **Now -- May 5th**: Requirements gathering > - Present to the Git community the proposal, asking what are the > features that it demands; Nice, an RFC before adding the command would go a long way! > - Inspect codebases that uses Git (specially `rev-parse`) as data > source; > - Studying more deeply the codebase, specially the `rev-parse` source > code; > > 2. **May 6th -- June 1st**: Community bonding > - Getting in touch with the mentors; > - Present to the community a first proposal of the JSON schema; > - Receive feedback from the community about the schema; > - Present a first proposal on the command line interface; > - Receive feedback from the community about the command line > interface; > > 3. **June 2nd -- June 31th**: First coding round > - Decide how the CLI should behave, that is, what should be exported > by default, what should be outputted only by using flags and what > should not be outputted by using disabling flags; > - Write a minimal JSON serializer, focusing on export only the > top-level object with only string values; > - Define a simple data structure that holds only one field of the > data that we want to export; > - Introduce a first version of the command, outputting the first data > structure using the first JSON serializer; > - Write test cases in different scenarios comparing the output of the > new command with the outputs of the existing commands; > - Write a minimal documentation file for the new command, making it > explicitly that the command is experimental and it's still under > development; > > 4. **July 1st -- July 14th**: Second coding round > - Fill the data structure with all other string fields that should be > exported; > - Add flags for filtering the data output; > - Write tests for each one of the new fields and flags; > - Improve the documentation file, explaining what the command does, > its output format and what the flags that were implemented so far > do; > > 5. **July 15th -- August 15th**: Third coding round > - Improve the JSON serializer, allowing other data types: array, > number, boolean, null and nested objects; > - Fill the data structure with the remaining non-string fields; > - Write flags, tests and documentation for the types that were > implemented in this round; > > 6. **August 16th -- August 25th**: Fourth coding round > - Polish the documentation, providing examples on how to use it and > notes on why this command was created; > - Finish remaining work from the previous rounds that still need to > be concluded; > > ### Availability > > 2025 is my last year in my master's degree. Currently, I'm not attending > any classes and I am more focused on developing the software of my > research, performing experiments and writing scientific articles and my > thesis. Since my advisor is aware that I'm proposing a GSoC project, it > will be possible to work on Git while working on my master's tasks. > One thing I found missing is the project size and what you think about it. Thanks - Karthik
Attachment:
signature.asc
Description: PGP signature