# Proposal for GSOC 2025 to Git **Machine-Readable Repository Information Query Tool** ## Contact Details * **Name**: K Jayatheerth * **Email**: jayatheerthkulkarni2005@xxxxxxxxx * **Blog**: [Blog](https://jayatheerthkulkarni.github.io/gsoc_blog/index.html) * **GitHub**: [GitHub](https://github.com/jayatheerthkulkarni) ## **Synopsis** This project aims to develop a dedicated Git command that interfaces with Git’s internal APIs to produce structured JSON output, particularly for repository metadata. By offering a clean, machine-readable format, this tool will improve automation, scripting, and integration with other developer tools. ## **Benefits to the Community** ### **1. Simplifies Automation and Scripting** - Many Git commands output **human-readable text**, making automation **error-prone** and **dependent on fragile parsing**. - This project introduces **structured JSON output**, allowing scripts and tools to consume repository metadata **directly and reliably**. - No more **awkward text parsing**, `grep` hacks, or brittle `awk/sed` pipelines—just **clean, structured data**. ### **2. Eliminates the Overuse of `git rev-parse`** - `git rev-parse` is widely misused for extracting metadata, despite being intended primarily for **parsing revisions**. - Developers often **repurpose** it because there’s **no dedicated alternative** for metadata queries. - This project **corrects that gap** by introducing a **purpose-built command** that is **cleaner, more intuitive, and extensible**. ### **3. Optimizes CI/CD Pipelines** - CI/CD systems currently need **multiple Git commands** and associated parsing logic to fetch basic metadata: ```bash # Example: Gathering just a few common pieces of info BRANCH=$(git rev-parse --abbrev-ref HEAD 2>/dev/null || echo "DETACHED") COMMIT=$(git rev-parse HEAD) REMOTE_URL=$(git remote get-url origin 2>/dev/null || echo "no-origin") # ... often requiring more commands and error handling logic. ``` - The proposed command aims to **replace these multiple calls** with a **single, efficient query** returning comprehensive, structured JSON data. - This **simplifies pipeline scripts**, reduces process overhead, and makes CI/CD configurations **cleaner and more robust**. ## Deliverables This project will introduce a new Git command, tentatively named `git metadata`, to provide reliable, machine-readable repository information. The key deliverables for this GSoC project include: 1. **Core `git metadata` Command:** * A new `builtin/metadata.c` command integrated into the Git source code. * Implementation primarily in C, utilizing existing internal Git APIs for retrieving repository information efficiently and accurately. 2. **Default JSON Output:** * The command will output a structured JSON object by default. * **Initial Core Fields:** * `repository`: Path to `.git` directory, worktree root, `is_bare` status. * `head`: Current commit SHA (full), current reference (`refs/heads/main`, `refs/tags/v1.0`, or detached HEAD commit), short symbolic name (`main`, `v1.0`, or `DETACHED`). * `remotes`: A map of remote names to their fetch and push URLs. * *(Stretch Goal):* Basic `is_dirty` flag based on a quick index/HEAD check (not full worktree scan). 3. **Basic Output Control:** * *(If time permits / Stretch Goal)* Implement simple flags to control output, e.g.: * `--remotes-only`: Output only the `remotes` section of the JSON. * `--head-only`: Output only the `head` section. * `--json-errors`: Ensure that errors encountered during execution (e.g., not in a Git repository) are reported in a structured JSON format. 4. **Extensible Design:** * The internal structure and JSON schema will be designed with future extensions in mind (e.g., adding submodule info, specific config values, tags later). 5. **Comprehensive Documentation:** * A clear man page (`git-metadata.txt`) explaining the command's purpose, usage, options, and JSON output format. * Comments within the code explaining implementation details. 6. **Robust Test Suite:** * A new test script (`t/tXXXX-metadata.sh`) using Git's test framework. * Tests covering various repository states: standard repo, bare repo, detached HEAD, unborn branch, repo with no remotes, etc. * Tests validating the JSON output structure and content. **Out of Scope for GSoC (Potential Future Work):** * Complex status reporting (full `git status` equivalent, detailed submodule status). * Real-time monitoring (`--watch`). * Comparing metadata between revisions (`--diff`). * Alternative output formats (`--format=shell`). * Querying arbitrary configuration values or extensive commit details beyond HEAD. ## Technical Details This section outlines the proposed technical approach for implementing the core deliverables: 1. **Core `git metadata` Command & Default JSON Output:** * **Entry Point:** Implement the command logic within a new `builtin/metadata.c` file, defining the `cmd_metadata(...)` function as the entry point, following Git's builtin command structure. * **Repository Access:** The `cmd_metadata` function will operate on the `struct repository*` provided by the command invocation infrastructure. * **Repository Info:** * Retrieve the path to the `.git` directory using `repo->gitdir` (or `get_git_dir()` if needed). * Determine if the repository is bare using `repo->is_bare`. * **HEAD Info:** * Resolve the `HEAD` reference using `refs_resolve_ref_unsafe("HEAD", RESOLVE_REF_READING, &head_oid, &head_ref_name, &head_flags)`. This will provide the full commit OID (`head_oid`) and the full reference name (`head_ref_name`, e.g., `"refs/heads/main"`). * Determine the conventional short symbolic name (e.g., `"main"`, `"v1.0"`, or `"(HEAD detached at <sha>)"`) by investigating and utilizing existing Git functions like `refs_shorten_unambiguous_ref()` or similar logic found in commands like `git status` or `git branch`. Using low-level string functions like `strchr` will be avoided for robustness. * **Remotes Info:** * Utilize functions from `remote.h`/`remote.c` (e.g., `remote_get`, iterate through configured remotes) to get the list of remote names. * For each remote, query its fetch and push URLs using Git's configuration API (e.g., `git_config_get_string` for keys like `remote.<name>.url` and `remote.<name>.pushurl`). Handle cases where push URL is not explicitly set. * **JSON Generation:** * *(Primary Strategy):* Investigate integrating a minimal, dependency-free, GPLv2-compatible C JSON library (e.g., cJSON, subject to community approval) for robust JSON construction and escaping. * *(Fallback Strategy):* If a library is not feasible, manually construct the JSON string using Git's `strbuf` API (`strbuf_addf`, `strbuf_addch`, `strbuf_add_json_string`, etc.), paying careful attention to correct JSON syntax and proper escaping of string values. 2. **Documentation:** * Create `Documentation/git-metadata.txt` following the structure and style of existing Git man pages (e.g., `git-rev-parse.txt`, `git-branch.txt`). * Clearly document the command's purpose, all options (including stretch goals if implemented), and provide a detailed description of the default JSON output schema with examples. 3. **Testing:** * Create a new test script `t/tXXXX-metadata.sh` using Git's shell-based test framework (`test-lib.sh`). * Include test cases covering: * Standard repositories. * Bare repositories. * Repositories with detached HEAD state. * Repositories on an unborn branch. * Repositories with no remotes, one remote, multiple remotes. * Remotes with different fetch/push URL configurations. * Validation of the JSON output structure and specific field values using tools like `jq` or simple `grep` checks within the tests. * Testing of error conditions and the `--json-errors` flag output (if implemented). ## Detailed Project Timeline **Phase 0: Pre-Acceptance Preparation (April 9 - May 7, 2025)** * **Focus:** Demonstrate continued interest and deepen understanding while awaiting results. * **Official GSoC Milestone:** April 8, 2025 - Proposal Deadline. * **Activities:** * **(April 9 - April 21):** Deep dive into Git's source code structure, focusing specifically on areas identified in the proposal's Technical Details: * `builtin/` directory structure and command handling. * `repository.h`, `refs.h`, `remote.h`, `config.c`, `strbuf.h`. * How existing commands like `git status`, `git branch`, `git rev-parse`, `git remote -v` access underlying data. * **(April 22 - May 7):** * Monitor the Git mailing list for discussions related to repository information, command output formats, or JSON usage. * Refine understanding of Git's testing framework as I've not done a deep dive into tests(`t/test-lib.sh`). Try running and understanding existing tests relevant to refs, remotes, or configuration. * Review Git's contribution guidelines (`SubmittingPatches`, coding style) again since most of my microproject time was related to documentation. * Try to start some more microprojects or actively converse in other patches. **Phase 1: Finalize the requirements (May 8 - May 26, 2025 Approx.)** * **Focus:** Finalize plans with mentors, setup, deep dive into specifics. * **Official GSoC Milestone:** May 8, 2025 - Accepted Projects Announced. * **Activities:** * **(Week 1: May 8 - May 12):** * Discuss the project proposal in detail, clarifying scope, priorities, and mentor expectations. * Finalize the decision on the JSON generation strategy (library vs. `strbuf`) based on mentor feedback and feasibility assessment. * Confirm the initial target JSON schema. * **(Week 2: May 13 - May 19):** * Perform a deep dive into the *specific* functions identified for use (e.g., `resolve_ref_unsafe`, `shorten_unambiguous_ref`, remote access functions, config API, chosen JSON method). * Start outlining the structure of `builtin/metadata.c`. * **(Week 3: May 20 - May 26):** * Begin writing the basic skeleton of `builtin/metadata.c` and the initial test file `t/tXXXX-metadata.sh`. * Post first blog update summarizing Initial plan. **Phase 2: Core Implementation & Setup (Coding Weeks 1-4: May 27 - June 23, 2025 Approx.)** * **Focus:** Implement the basic command structure and retrieve core repository/HEAD information. * **Activities:** * **(Week 1: May 27 - June 2):** Implement `cmd_metadata` skeleton, argument parsing (if any initially), repository struct access. Implement retrieval of `.git` path and `is_bare` status. Integrate chosen JSON generation approach (setup library or `strbuf` helpers). * **(Week 2: June 3 - June 9):** Implement HEAD resolution (commit SHA, full ref name). Implement logic for determining the short symbolic name using appropriate Git functions. Integrate HEAD info into JSON output. * **(Week 3: June 10 - June 16):** Write initial test cases in `t/tXXXX-metadata.sh` covering basic invocation, bare repos, and detached HEAD states. Refine JSON output structure. * **(Week 4: June 17 - June 23):** Prepare and submit the first set of patches covering core repo/HEAD functionality to the mailing list. Address initial feedback. Write blog post update. **Phase 3: Adding Remotes & Refinement (Coding Weeks 5-8: June 24 - July 21, 2025 Approx.)** * **Focus:** Add remote information retrieval and expand testing significantly. Aim for demonstrable core functionality by Midterm. * **GSoC Milestone:** Midterm Evaluations. * **Activities:** * **(Week 5: June 24 - June 30):** Research and implement logic to list remote names. Implement logic to query fetch/push URLs for each remote using the config API. * **(Week 6: July 1 - July 7):** Integrate remote information into the JSON output structure. Handle edge cases (no remotes, missing push URL). * **(Week 7: July 8 - July 14):** Significantly expand the test suite: add tests for various remote configurations, unborn branches. Refine existing tests based on feedback. Start drafting the man page (`Documentation/git-metadata.txt`). * **(Week 8: July 15 - July 21):** Prepare and submit patches for remote functionality. Ensure core command (`repo`, `head`, `remotes` info) is stable and well-tested for Midterm Evaluation. Code cleanup based on reviews. Write blog post update and prepare Midterm Evaluation submission. **Phase 4: Documentation, Polish & Stretch Goals (Coding Weeks 9-12: July 22 - Aug 18, 2025 Approx.)** * **Focus:** Finalize documentation, implement error handling, address feedback, attempt stretch goals if feasible. * **Activities:** * **(Week 9: July 22 - July 28):** Complete the first draft of the man page, detailing usage, JSON schema, and options. Implement the `--json-errors` functionality for structured error reporting. Add tests for error cases. * **(Week 10: July 29 - Aug 4):** *Begin Stretch Goals (Conditional):* If core work is stable and time permits, start implementing `--head-only` / `--remotes-only` flags or the basic `is_dirty` check. Add tests for any implemented stretch goals. * **(Week 11: Aug 5 - Aug 11):** Thorough code cleanup, address all outstanding review comments on submitted patches. Ensure documentation is comprehensive and accurate. Final pass on test suite coverage. * **(Week 12: Aug 12 - Aug 18):** Prepare and submit final patches incorporating documentation, error handling, and any completed stretch goals. Final code freeze for GSoC evaluation purposes. Write blog post update summarizing final phase. **Phase 5: Final Evaluation & Wrap-up (Aug 19 - Nov 19, 2025)** * **Focus:** Final submissions, respond to late feedback, ensure project completion. * **GSoC Milestone:** Final Evaluations likely occur early in this period. * **Official GSoC Milestone:** November 19, 2025 - Program End Date. * **Activities:** * **(Late Aug - Sept):** continue for any incompletions and follow up for next set of projects(Stretch goals) * **(Oct - Nov 19):** Monitor mailing list for patch status. Write final GSoC project summary blog post. Continue engaging with the community if interested in further contributions beyond GSoC. ## Past Communication and Microproject * **Blog**: [Blog](https://jayatheerthkulkarni.github.io/gsoc_blog/index.html) This blog contains a detailed communication description and blog of my microproject experience. * First Introduction to the Git Mailing list: [first Mail](https://lore.kernel.org/git/CA+rGoLc69R8qgbkYQiKoc2uweDwD10mxZXYFSY8xFs5eKSRVkA@xxxxxxxxxxxxxx/t/#u) * First patch to the git mailing list: [First Patch](https://lore.kernel.org/git/20250312081534.75536-1-jayatheerthkulkarni2005@xxxxxxxxx/t/#u) * Most recent series of patches and back and forth with feedbacks: [Main mail thread](https://lore.kernel.org/git/xmqqa59evffd.fsf@gitster.g/T/#t) I've been maintaing the blog and will maintain the blogs of all the communication of mine to the git mailing list. Thank You, Jayatheerth