[GSOC] [Proposal v1] Machine-Readable Repository Information Query Tool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



# Proposal for GSOC 2025 to Git
**Machine-Readable Repository Information Query Tool**

## Contact Details
* **Name**: K Jayatheerth
* **Email**: jayatheerthkulkarni2005@xxxxxxxxx
* **Blog**: [Blog](https://jayatheerthkulkarni.github.io/gsoc_blog/index.html)
* **GitHub**: [GitHub](https://github.com/jayatheerthkulkarni)

## **Synopsis**
This project aims to develop a dedicated Git command that interfaces
with Git’s internal APIs to produce structured JSON output,
particularly for repository metadata. By offering a clean,
machine-readable format, this tool will improve automation, scripting,
and integration with other developer tools.

## **Benefits to the Community**
### **1. Simplifies Automation and Scripting**
- Many Git commands output **human-readable text**, making automation
**error-prone** and **dependent on fragile parsing**.
- This project introduces **structured JSON output**, allowing scripts
and tools to consume repository metadata **directly and reliably**.
- No more **awkward text parsing**, `grep` hacks, or brittle `awk/sed`
pipelines—just **clean, structured data**.

### **2. Eliminates the Overuse of `git rev-parse`**
- `git rev-parse` is widely misused for extracting metadata, despite
being intended primarily for **parsing revisions**.
- Developers often **repurpose** it because there’s **no dedicated
alternative** for metadata queries.
- This project **corrects that gap** by introducing a **purpose-built
command** that is **cleaner, more intuitive, and extensible**.

### **3. Optimizes CI/CD Pipelines**
- CI/CD systems currently need **multiple Git commands** and
associated parsing logic to fetch basic metadata:

```bash
# Example: Gathering just a few common pieces of info
BRANCH=$(git rev-parse --abbrev-ref HEAD 2>/dev/null || echo "DETACHED")
COMMIT=$(git rev-parse HEAD)
REMOTE_URL=$(git remote get-url origin 2>/dev/null || echo "no-origin")
# ... often requiring more commands and error handling logic.
```
- The proposed command aims to **replace these multiple calls** with a
**single, efficient query** returning comprehensive, structured JSON
data.
- This **simplifies pipeline scripts**, reduces process overhead, and
makes CI/CD configurations **cleaner and more robust**.

## Deliverables

This project will introduce a new Git command, tentatively named `git
metadata`, to provide reliable, machine-readable repository
information.

The key deliverables for this GSoC project include:

1. **Core `git metadata` Command:**
* A new `builtin/metadata.c` command integrated into the Git source code.
* Implementation primarily in C, utilizing existing internal Git APIs
for retrieving repository information efficiently and accurately.

2. **Default JSON Output:**
* The command will output a structured JSON object by default.
* **Initial Core Fields:**
* `repository`: Path to `.git` directory, worktree root, `is_bare` status.
* `head`: Current commit SHA (full), current reference
(`refs/heads/main`, `refs/tags/v1.0`, or detached HEAD commit), short
symbolic name (`main`, `v1.0`, or `DETACHED`).
* `remotes`: A map of remote names to their fetch and push URLs.
* *(Stretch Goal):* Basic `is_dirty` flag based on a quick index/HEAD
check (not full worktree scan).

3. **Basic Output Control:**
* *(If time permits / Stretch Goal)* Implement simple flags to control
output, e.g.:
* `--remotes-only`: Output only the `remotes` section of the JSON.
* `--head-only`: Output only the `head` section.
* `--json-errors`: Ensure that errors encountered during execution
(e.g., not in a Git repository) are reported in a structured JSON
format.

4. **Extensible Design:**
* The internal structure and JSON schema will be designed with future
extensions in mind (e.g., adding submodule info, specific config
values, tags later).

5. **Comprehensive Documentation:**
* A clear man page (`git-metadata.txt`) explaining the command's
purpose, usage, options, and JSON output format.
* Comments within the code explaining implementation details.

6. **Robust Test Suite:**
* A new test script (`t/tXXXX-metadata.sh`) using Git's test framework.
* Tests covering various repository states: standard repo, bare repo,
detached HEAD, unborn branch, repo with no remotes, etc.
* Tests validating the JSON output structure and content.

**Out of Scope for GSoC (Potential Future Work):**
* Complex status reporting (full `git status` equivalent, detailed
submodule status).
* Real-time monitoring (`--watch`).
* Comparing metadata between revisions (`--diff`).
* Alternative output formats (`--format=shell`).
* Querying arbitrary configuration values or extensive commit details
beyond HEAD.

## Technical Details

This section outlines the proposed technical approach for implementing
the core deliverables:

1. **Core `git metadata` Command & Default JSON Output:**
* **Entry Point:** Implement the command logic within a new
`builtin/metadata.c` file, defining the `cmd_metadata(...)` function
as the entry point, following Git's builtin command structure.
* **Repository Access:** The `cmd_metadata` function will operate on
the `struct repository*` provided by the command invocation
infrastructure.
* **Repository Info:**
* Retrieve the path to the `.git` directory using `repo->gitdir` (or
`get_git_dir()` if needed).
* Determine if the repository is bare using `repo->is_bare`.
* **HEAD Info:**
* Resolve the `HEAD` reference using `refs_resolve_ref_unsafe("HEAD",
RESOLVE_REF_READING, &head_oid, &head_ref_name, &head_flags)`. This
will provide the full commit OID (`head_oid`) and the full reference
name (`head_ref_name`, e.g., `"refs/heads/main"`).
* Determine the conventional short symbolic name (e.g., `"main"`,
`"v1.0"`, or `"(HEAD detached at <sha>)"`) by investigating and
utilizing existing Git functions like `refs_shorten_unambiguous_ref()`
or similar logic found in commands like `git status` or `git branch`.
Using low-level string functions like `strchr` will be avoided for
robustness.
* **Remotes Info:**
* Utilize functions from `remote.h`/`remote.c` (e.g., `remote_get`,
iterate through configured remotes) to get the list of remote names.
* For each remote, query its fetch and push URLs using Git's
configuration API (e.g., `git_config_get_string` for keys like
`remote.<name>.url` and `remote.<name>.pushurl`). Handle cases where
push URL is not explicitly set.
* **JSON Generation:**
* *(Primary Strategy):* Investigate integrating a minimal,
dependency-free, GPLv2-compatible C JSON library (e.g., cJSON, subject
to community approval) for robust JSON construction and escaping.
* *(Fallback Strategy):* If a library is not feasible, manually
construct the JSON string using Git's `strbuf` API (`strbuf_addf`,
`strbuf_addch`, `strbuf_add_json_string`, etc.), paying careful
attention to correct JSON syntax and proper escaping of string values.

2. **Documentation:**
* Create `Documentation/git-metadata.txt` following the structure and
style of existing Git man pages (e.g., `git-rev-parse.txt`,
`git-branch.txt`).
* Clearly document the command's purpose, all options (including
stretch goals if implemented), and provide a detailed description of
the default JSON output schema with examples.

3. **Testing:**
* Create a new test script `t/tXXXX-metadata.sh` using Git's
shell-based test framework (`test-lib.sh`).
* Include test cases covering:
* Standard repositories.
* Bare repositories.
* Repositories with detached HEAD state.
* Repositories on an unborn branch.
* Repositories with no remotes, one remote, multiple remotes.
* Remotes with different fetch/push URL configurations.
* Validation of the JSON output structure and specific field values
using tools like `jq` or simple `grep` checks within the tests.
* Testing of error conditions and the `--json-errors` flag output (if
implemented).

## Detailed Project Timeline


**Phase 0: Pre-Acceptance Preparation (April 9 - May 7, 2025)**

* **Focus:** Demonstrate continued interest and deepen understanding
while awaiting results.
* **Official GSoC Milestone:** April 8, 2025 - Proposal Deadline.
* **Activities:**
* **(April 9 - April 21):** Deep dive into Git's source code
structure, focusing specifically on areas identified in the proposal's
Technical Details:
* `builtin/` directory structure and command handling.
* `repository.h`, `refs.h`, `remote.h`, `config.c`, `strbuf.h`.
* How existing commands like `git status`, `git branch`, `git
rev-parse`, `git remote -v` access underlying data.
* **(April 22 - May 7):**
* Monitor the Git mailing list for discussions related to repository
information, command output formats, or JSON usage.
* Refine understanding of Git's testing framework as I've not done a
deep dive into tests(`t/test-lib.sh`). Try running and understanding
existing tests relevant to refs, remotes, or configuration.
* Review Git's contribution guidelines (`SubmittingPatches`, coding
style) again since most of my microproject time was related to
documentation.
* Try to start some more microprojects or actively converse in other patches.

**Phase 1: Finalize the requirements (May 8 - May 26, 2025 Approx.)**

* **Focus:** Finalize plans with mentors, setup, deep dive into specifics.
* **Official GSoC Milestone:** May 8, 2025 - Accepted Projects Announced.
* **Activities:**
* **(Week 1: May 8 - May 12):**
* Discuss the project proposal in detail, clarifying scope,
priorities, and mentor expectations.
* Finalize the decision on the JSON generation strategy (library vs.
`strbuf`) based on mentor feedback and feasibility assessment.
* Confirm the initial target JSON schema.
* **(Week 2: May 13 - May 19):**
* Perform a deep dive into the *specific* functions identified for use
(e.g., `resolve_ref_unsafe`, `shorten_unambiguous_ref`, remote access
functions, config API, chosen JSON method).
* Start outlining the structure of `builtin/metadata.c`.
* **(Week 3: May 20 - May 26):**
* Begin writing the basic skeleton of `builtin/metadata.c` and the
initial test file `t/tXXXX-metadata.sh`.
* Post first blog update summarizing Initial plan.

**Phase 2: Core Implementation & Setup (Coding Weeks 1-4: May 27 -
June 23, 2025 Approx.)**

* **Focus:** Implement the basic command structure and retrieve core
repository/HEAD information.
* **Activities:**
* **(Week 1: May 27 - June 2):** Implement `cmd_metadata` skeleton,
argument parsing (if any initially), repository struct access.
Implement retrieval of `.git` path and `is_bare` status. Integrate
chosen JSON generation approach (setup library or `strbuf` helpers).
* **(Week 2: June 3 - June 9):** Implement HEAD resolution (commit
SHA, full ref name). Implement logic for determining the short
symbolic name using appropriate Git functions. Integrate HEAD info
into JSON output.
* **(Week 3: June 10 - June 16):** Write initial test cases in
`t/tXXXX-metadata.sh` covering basic invocation, bare repos, and
detached HEAD states. Refine JSON output structure.
* **(Week 4: June 17 - June 23):** Prepare and submit the first set of
patches covering core repo/HEAD functionality to the mailing list.
Address initial feedback. Write blog post update.

**Phase 3: Adding Remotes & Refinement (Coding Weeks 5-8: June 24 -
July 21, 2025 Approx.)**

* **Focus:** Add remote information retrieval and expand testing
significantly. Aim for demonstrable core functionality by Midterm.
* **GSoC Milestone:** Midterm Evaluations.
* **Activities:**
* **(Week 5: June 24 - June 30):** Research and implement logic to
list remote names. Implement logic to query fetch/push URLs for each
remote using the config API.
* **(Week 6: July 1 - July 7):** Integrate remote information into the
JSON output structure. Handle edge cases (no remotes, missing push
URL).
* **(Week 7: July 8 - July 14):** Significantly expand the test suite:
add tests for various remote configurations, unborn branches. Refine
existing tests based on feedback. Start drafting the man page
(`Documentation/git-metadata.txt`).
* **(Week 8: July 15 - July 21):** Prepare and submit patches for
remote functionality. Ensure core command (`repo`, `head`, `remotes`
info) is stable and well-tested for Midterm Evaluation. Code cleanup
based on reviews. Write blog post update and prepare Midterm
Evaluation submission.

**Phase 4: Documentation, Polish & Stretch Goals (Coding Weeks 9-12:
July 22 - Aug 18, 2025 Approx.)**

* **Focus:** Finalize documentation, implement error handling, address
feedback, attempt stretch goals if feasible.
* **Activities:**
* **(Week 9: July 22 - July 28):** Complete the first draft of the man
page, detailing usage, JSON schema, and options. Implement the
`--json-errors` functionality for structured error reporting. Add
tests for error cases.
* **(Week 10: July 29 - Aug 4):** *Begin Stretch Goals (Conditional):*
If core work is stable and time permits, start implementing
`--head-only` / `--remotes-only` flags or the basic `is_dirty` check.
Add tests for any implemented stretch goals.
* **(Week 11: Aug 5 - Aug 11):** Thorough code cleanup, address all
outstanding review comments on submitted patches. Ensure documentation
is comprehensive and accurate. Final pass on test suite coverage.
* **(Week 12: Aug 12 - Aug 18):** Prepare and submit final patches
incorporating documentation, error handling, and any completed stretch
goals. Final code freeze for GSoC evaluation purposes. Write blog post
update summarizing final phase.

**Phase 5: Final Evaluation & Wrap-up (Aug 19 - Nov 19, 2025)**

* **Focus:** Final submissions, respond to late feedback, ensure
project completion.
* **GSoC Milestone:** Final Evaluations likely occur early in this period.
* **Official GSoC Milestone:** November 19, 2025 - Program End Date.
* **Activities:**
* **(Late Aug - Sept):** continue for any incompletions and follow up
for next set of projects(Stretch goals)
* **(Oct - Nov 19):** Monitor mailing list for patch status. Write
final GSoC project summary blog post. Continue engaging with the
community if interested in further contributions beyond GSoC.

## Past Communication and Microproject
* **Blog**: [Blog](https://jayatheerthkulkarni.github.io/gsoc_blog/index.html)
This blog contains a detailed communication description and blog of my
microproject experience.
* First Introduction to the Git Mailing list: [first
Mail](https://lore.kernel.org/git/CA+rGoLc69R8qgbkYQiKoc2uweDwD10mxZXYFSY8xFs5eKSRVkA@xxxxxxxxxxxxxx/t/#u)
* First patch to the git mailing list: [First
Patch](https://lore.kernel.org/git/20250312081534.75536-1-jayatheerthkulkarni2005@xxxxxxxxx/t/#u)
* Most recent series of patches and back and forth with feedbacks:
[Main mail thread](https://lore.kernel.org/git/xmqqa59evffd.fsf@gitster.g/T/#t)

I've been maintaing the blog and will maintain the blogs of all the
communication of mine to the git mailing list.


Thank You,
Jayatheerth





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux