Re: [GSOC] [Proposal v1] Machine-Readable Repository Information Query Tool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Apr 6, 2025 at 1:12 AM Karthik Nayak <karthik.188@xxxxxxxxx> wrote:
>
> JAYATHEERTH K <jayatheerthkulkarni2005@xxxxxxxxx> writes:
>
> > # Proposal for GSOC 2025 to Git
> > **Refactoring `git rev-parse`: A Dedicated Command for Repository Information**
> >
> > ## Contact Details
> > * **Name**: K Jayatheerth
> > * **Email**: jayatheerthkulkarni2005@xxxxxxxxx
> > * **Blog**: [Blog](https://jayatheerthkulkarni.github.io/gsoc_blog/index.html)
> > * **GitHub**: [GitHub](https://github.com/jayatheerthkulkarni)
> >
> > ## Prerequisites & Experience
> >
> >
> > As part of the GSoC application prerequisites, I have engaged with the
> > Git community with a microproject involving documentation changes.
> > This provided valuable experience with Git's codebase, contribution
> > workflow (patch submission, feedback cycles), and communication via
> > the mailing list.
> >
> > * **Microproject Patch Series:** [Main mail
> > thread](https://lore.kernel.org/git/xmqqa59evffd.fsf@gitster.g/T/#t)
> > (Link to the most relevant thread demonstrating interaction and
> > successful patch refinement)
> > * **Initial Patch:** [First
> > Patch](https://lore.kernel.org/git/20250312081534.75536-1-jayatheerthkulkarni2005@xxxxxxxxx/t/#u)
> > * **Mailing List Introduction:** [First
> > Mail](https://lore.kernel.org/git/CA+rGoLc69R8qgbkYQiKoc2uweDwD10mxZXYFSY8xFs5eKSRVkA@xxxxxxxxxxxxxx/t/#u)
> > * **Blog:** My GSoC blog details these interactions:
> > [Blog](https://jayatheerthkulkarni.github.io/gsoc_blog/index.html)
> >
>
> It would be nice to give a small brief about your microproject here and
> perhaps its current status.
>

Sure I will add that.

> > ## **Synopsis**
> >
> > This project focuses on **refactoring Git by creating a dedicated
> > command (tentatively named `git repo-info`) to house the low-level
>
> I wonder if 'git info' itself would be a good name, we could default the
> command to list all prominent information about a repository. This would
> be useful instead of scripts invoking 'git rev-parse

Yeah even git info doesn't overlap with any builtin/third
party(popular commands) as far as I can search.

> --is-bare-repository' followed by a 'git rev-parse --is-inside-git-dir'
> and so on. But we can discuss this later.
>

True, but for the proposal I listed some commands in detail and
technical details with batch wise division of timeline as required.

> > repository, path, and format-related query options currently misplaced
> > under the "OPTIONS FOR FILES" section of `git-rev-parse(1)`**. This
> > new command will provide a more logical and maintainable location for
> > this functionality, allowing `git rev-parse` to better focus on its
> > core purpose of parsing revisions, thus improving Git's internal
> > organization and command structure clarity.
> >
> > ## **Benefits to the Community**
> >
> > ### **1. Improves `git rev-parse` Clarity and Maintainability**
> > - `git rev-parse` has accumulated various options unrelated to its
> > primary purpose of parsing revisions, particularly those for querying
> > low-level repository state and paths.
> > - This project **directly addresses this issue** by migrating these
> > options to a dedicated command, making `git rev-parse` cleaner and
> > easier to understand and maintain.
> > - Provides a **clearer separation of concerns** within Git's command suite.
> >
> > ### **2. Provides Reliable Access for Automation and Scripting**
> > - Scripts often need fundamental repository information like the
> > top-level directory path (`--show-toplevel`), the `.git` directory
> > location (`--git-dir`), or repository state (`--is-bare-repository`).
> > - Currently, scripts rely on `git rev-parse` for this, mixing
> > low-level repo queries with revision parsing calls.
> > - The new `git repo-info` command will offer a **stable, dedicated
> > interface** for retrieving this specific low-level information, making
> > scripts **cleaner and more robust** by calling the command designed
> > explicitly for these tasks.
> > - The default output will mimic the **existing, simple text format**
> > of the `rev-parse` options, ensuring compatibility for scripts
> > migrating to the new command.
> >
> > ### **3. Enhances CI/CD Pipeline Foundations**
> > - CI/CD pipelines frequently need to establish context by determining
> > the repository root or `.git` directory location early in their
> > execution.
> > - Using the dedicated `git repo-info` command for these foundational
> > queries **simplifies the initial setup steps** in pipeline scripts
> > compared to using the overloaded `git rev-parse`.
> >
>
> We must note that using 'git rev-parse' isn't sub-optimal. Matter of
> fact, if you look at 'builtin/rev-parse.c', you'll see that each of
> these flags are under an if..else clause. The goal of this project is
> more to provide a clean interface and a home for such information
> queries.
>

I've gone through the code in rev-parse.c and yes I found the if blocks,
I'm taking that you mean that git rev-parse is not meant to do these
tasks, but it is good at what it does.
If this is the case I think it makes the project easier, because we
have to port similar code and test and document it carefully.

> As such the main goals of the project, are very design oriented, as
> you've also mentioned, I would say:
> 1. What should the command be called?
> 2. What sub-commands should it support?
> 3. What options from 'git-rev-parse(1)' do we need to port, do they need
>    to be renamed?
> 4. What other options can we possibly provide?
>

Yup, I think I've not explored the 3 -> do they need to be renamed?
Maybe, that's also because the list of commands to port is still a bit
ambiguous, we can discuss the internal names once we create a detailed
list of commands that we need to port into the `git info` command.

> > ## Deliverables
> >
> > Acknowledging the needs that the project scope is focused on
> > refactoring `git rev-parse`, this project will introduce a new Git
> > command, tentatively named `git repo-info`, serving as the designated
> > home for specific low-level query options.
> >
> > The key deliverables for this GSoC project include:
> >
> > 1. **New Core Command: `git repo-info`**
> > * A new `builtin/repo-info.c` command integrated into the Git source code.
> > * Implementation primarily in C, leveraging existing internal Git APIs
> > and logic currently within `rev-parse.c` to implement the relocated
> > options.
> >
> > 2. **Relocated `rev-parse` Options:**
> > * Implementation of the core functionality behind the following
> > options from `git-rev-parse(1)`'s "OPTIONS FOR FILES" section within
> > the new `git repo-info` command:
> > * **Path Queries:** `--show-cdup`, `--show-prefix`, `--show-toplevel`,
> > `--show-superproject-working-tree`
> > * **Directory Queries:** `--git-dir`, `--git-common-dir`,
> > `--resolve-git-dir <path>`
> > * **State/Format Queries:** `--is-inside-git-dir`,
> > `--is-inside-work-tree`, `--is-bare-repository`,
> > `--is-shallow-repository`
> > * **Index File Query:** `--shared-index-path`
> >
>
> Perhaps we want to breakdown so we can have:
>
> git info path [--cdup | --prefix | --toplevel ... ]
> git info repo [--is-bare | --is-shallow]
>
> and so on...

Ok I will separate these with respect to their usage.

>
> > 3. **Default Output Format (Text-Based):**
> > * The command's default output for each implemented option will
> > **match the current plain text output** produced by `git rev-parse`
> > for that same option, ensuring backward compatibility for scripts
> > migrating to the new command. Output will primarily be via standard C
> > functions like `printf` or `puts`.
>
> Since we'll keep the existing options within 'git-rev-parse(1)', we'll
> stay backwards compatible. Scripts which want to move to the new
> command, would anyway need to change the command, so there is no
> backward compatibility there. But, having the default to be human
> readable like the current implementation does make sense.
>

Ok this raises a new question, if we are not looking towards backward
compatibility and also looking towards machine readability, I think
using JSON from the start would make things flexible, do correct me on
this.

Because once people start to make things compatible with human
readable text they probably won't use JSON again, and JSON might
mitigate some errors. So I think I need some feedback on this.

> >
> > 4. **Comprehensive Documentation:**
> > * A clear man page (`git-repo-info.adoc`) explaining the new command's
> > purpose and detailing the usage and output of each implemented option.
> > * Updates to `git-rev-parse.adoc` to clearly **deprecate** the
> > relocated options (or mark them as aliases for compatibility) and
> > point users to the new `git repo-info` command.
> >
> > 5. **Robust Test Suite:**
> > * A new test script (`t/tXXXX-repo-info.sh`) using Git's test
> > framework (`test-lib.sh`).
> > * Tests specifically validating the output of `git repo-info --option`
> > against the output of `git rev-parse --option` across various
> > repository states (standard repo, bare repo, inside `.git`, inside
> > worktree, submodules, shallow clone etc.) to ensure functional parity.
> >
> > 6. **(Stretch Goal / Potential Future Work): Structured Output**
> > * If time permits after successfully implementing, documenting, and
> > testing the core text-based functionality, investigate adding a
> > `--format=json` option to provide a structured JSON output containing
> > the results of the requested queries. This is explicitly a secondary
> > goal, contingent on completing the primary refactoring task.
>
> Many of the plumbing commands in Git provide NUL-terminated output, I'm
> curious whether we should consider that over JSON.
>

I think we can use a flag to define what type of output we want in the
long term.
I think both of these have their own pros and cons. As I said I think
I need some feedback on the format.

> > **Out of Scope for GSoC (Based on Refined Goal):**
> > * Querying high-level metadata like current branch name, HEAD commit
> > details (beyond `--is-shallow-repository`), remote URLs, tags, or
> > arbitrary configuration values.
>
> We should make sure that we don't overload this new command too.
>

Agreed, I think for this GSOC I will largely focus on porting existing
flags in rev-parse to a new command.

> > * Complex status reporting (worktree dirtiness).
> > * Real-time monitoring or comparing metadata between revisions.
> > * Implementing JSON output as the *primary* feature.
> >

Even with revisioning, I think the above part comes up, I think I will
remove this from the proposal too, I think even this make the command
cluttered.

> > ## Technical Details
> >
> > This section outlines the proposed technical approach for implementing
> > the `git repo-info` command and relocating the specified options:
> >
> > 1. **Core `git repo-info` Command Implementation:**
> > * **Entry Point:** Create `builtin/repo-info.c` with
> > `cmd_repo_info(...)` function. Parse options using Git's
> > `parse-options` API.
> > * **Repository Context:** Utilize the standard `repo` structure and
> > `startup_info` provided by Git's infrastructure. Setup the repository
> > context similar to how `cmd_rev_parse` does it if needed (e.g., using
> > `setup_git_directory_gently`).
> > * **Reusing Logic:** Analyze the implementation of the target options
> > within `builtin/rev-parse.c`. Extract and adapt the relevant C
> > functions and logic (related to path manipulation using `prefix_path`,
> > `real_pathcmp`; repository state checks using
> > `is_bare_repository_cfg`, `is_inside_git_dir`, `is_inside_work_tree`;
> > accessing `startup_info`, `git_path`, etc.) into `builtin/repo-info.c`
> > or potentially shared helper functions if appropriate.
> > * **Specific Option Implementation:**
> > * `--show-toplevel`, `--show-cdup`, `--show-prefix`: Rely on the
> > `prefix` calculated during setup and path manipulation functions.
> > * `--git-dir`, `--git-common-dir`: Access `repo->gitdir`,
> > `repo->commondir` or use functions like `get_git_dir()`,
> > `get_common_dir()`. `--resolve-git-dir` will involve path resolution
> > relative to the provided argument.
> > * `--is-*` flags: Call existing helper functions like
> > `is_bare_repository_cfg()`, `is_inside_git_dir()`,
> > `is_inside_work_tree()`. `--is-shallow-repository` involves checking
> > `repo->is_shallow`.
> > * `--shared-index-path`: Access path information related to split
> > indexes if enabled.
> > * **Output Generation:** Use standard C `printf("%s\n", ...)` or
> > `puts(...)` to print the resulting string (path, "true"/"false", etc.)
> > to standard output, matching `rev-parse`'s current behavior. Boolean
> > flags typically exit `0` for true and `1` for false without output,
> > this behavior should be preserved.
> >
> > 2. **Documentation:**
> > * Create `Documentation/git-repo-info.adoc` using AsciiDoc format,
> > modeling it after existing man pages. Detail each option, its purpose,
> > and expected output.
> > * Modify `Documentation/git-rev-parse.adoc`, adding notes to the
> > relevant options indicating they are better handled by `git repo-info`
> > and potentially marking them for deprecation in a future Git version.
> >
> > 3. **Testing:**
> > * Create `t/tXXXX-repo-info.sh` using `test-lib.sh`.
> > * Structure tests using `test_expect_success` blocks.
> > * Utilize helper functions like `test_create_repo`, `cd repo`,
> > `test_cmp` to compare the output of `git repo-info --option` directly
> > against `git rev-parse --option` (for options producing output) or
> > against expected exit codes (for boolean flags).
> > * Cover edge cases like running outside a repository, in a bare
> > repository, deep within a worktree, within the `.git` directory, and
> > in repositories with submodules or worktrees.
> >
> > 4. **(Stretch Goal) JSON Output Implementation:**
> > * If attempted, add a `--format=json` option using `parse-options`.
> > * Collect results from the requested options internally.
> > * Use either an approved embedded C JSON library or Git's `strbuf` API
> > (with helpers like `strbuf_add_json_string`) to construct a JSON
> > object mapping option names (or descriptive keys) to their
> > corresponding values. Print the final JSON string to standard output.
> > Add specific tests for JSON output validation.
> >
> > ## Detailed Project Timeline
> >
> > **Phase 0: Pre-Acceptance Preparation (April 9 - May 7, 2025)**
> >
> > * **Focus:** Demonstrate continued interest and deepen understanding
> > *specifically of `rev-parse`'s internals* while awaiting results.
> > * **Activities:**
> > * **(April 9 - April 21):** Deep dive into `builtin/rev-parse.c`,
> > identifying the exact code blocks implementing the "OPTIONS FOR
> > FILES". Trace how they use `startup_info`, `prefix`, path functions,
> > and repository flags.
> > * **(April 22 - May 7):** Continue monitoring the mailing list. Refine
> > understanding of Git's testing framework, specifically focusing on
> > tests for `rev-parse` options (e.g., `t1006-cat-file.sh`,
> > `t5601-clone.sh` might use some flags). Review contribution
> > guidelines.
> >
> > **Phase 1: Final Planning (May 8 - May 26, 2025 Approx.)**
> >
> > * **Focus:** Formal introductions, confirm final scope & plan, setup.
> > * **Activities:**
> > * **(Week 1: May 8 - May 12):** Introduction with mentor(s). Confirm
> > the exact list of `rev-parse` options to be migrated. Discuss the
> > preferred approach for handling deprecation in `rev-parse` docs/code.
> > Discuss potential for shared helper functions vs. direct code
> > migration.
> > * **(Week 2: May 13 - May 19):** Set up dev environment. Deep dive
> > into the agreed-upon functions/code blocks within `rev-parse.c`.
> > Outline the basic structure for `builtin/repo-info.c` and the test
> > script `t/tXXXX-repo-info.sh`.
> > * **(Week 3: May 20 - May 26):** Implement the basic `cmd_repo_info`
> > skeleton, option parsing setup, and repository setup boilerplate.
> > Write initial "no-op" tests. Post first blog update.
> >
> > **Phase 2: Implementation in Batches (Coding Weeks 1-8: May 27 - July
> > 21, 2025 Approx.)**
> >
> > * **Focus:** Implement options in logical groups, test thoroughly,
> > submit patches early and often.
> > * **GSoC Milestone:** Midterm Evaluations occur around Week 8.
> > * **Activities:**
> > * **(Batch 1 / Weeks 1-2: May 27 - June 9):** Implement basic path
> > queries: `--show-toplevel`, `--show-prefix`, `--show-cdup`. Add tests
> > comparing output with `rev-parse`. **Submit Patch Series 1**.
> > * **(Batch 2 / Weeks 3-4: June 10 - June 23):** Implement directory
> > queries: `--git-dir`, `--git-common-dir`, `--resolve-git-dir <path>`.
> > Add tests. **Submit Patch Series 2**. Write blog post update.
> > * **(Batch 3 / Weeks 5-6: June 24 - July 7):** Implement boolean state
> > queries: `--is-bare-repository`, `--is-inside-git-dir`,
> > `--is-inside-work-tree`. Add tests checking exit codes and behavior in
> > various locations. **Submit Patch Series 3**.
> > * **(Batch 4 / Weeks 7-8: July 8 - July 21):** Implement remaining
> > queries: `--is-shallow-repository`, `--shared-index-path`,
> > `--show-superproject-working-tree`. Add comprehensive tests covering
> > interactions (e.g., in submodules, shallow clones). **Submit Patch
> > Series 4**. Prepare for Midterm evaluation; ensure submitted batches
> > demonstrate core progress. Write blog post update.
> >
>
> It is nice to see breakdown in batches.
>

Thank you.

> > **Phase 3: Documentation & Final Polish (Coding Weeks 9-12: July 22 -
> > Aug 18, 2025 Approx.)**
> >
> > * **Focus:** Create documentation, address feedback on all patches,
> > refine implementation, potentially attempt stretch goal.
>
> I would say documentation should go hand-in-hand with each patch series
> that you send. Ideally every patch series should leave the code base in
> a usable state. Pushing documentation to the end, would mean that if the
> project is stopped midway, we'd have a running command in Git with no
> documentation about what it does.
>

Ok I think I will add them parallel to the batches above, nice point.

> > * **Activities:**
> > * **(Week 9: July 22 - July 28):** Write the first complete draft of
> > the man page for `git-repo-info`. Draft the necessary updates for
> > `git-rev-parse.adoc` (deprecation notices). **Submit Patch Series 5
> > (Documentation)**.
> > * **(Week 10: July 29 - Aug 4):** Focus on addressing review comments
> > on **all** previous patch series. Refactor code based on feedback.
> > Ensure test suite is robust and covers feedback points.
> > * **(Week 11: Aug 5 - Aug 11):** *Stretch Goal (Conditional):* If core
> > functionality and docs are stable and reviewed positively, begin
> > investigating/implementing `--format=json`. Add specific JSON tests if
> > implemented. Otherwise, focus on further code cleanup and test
> > hardening.
> > * **(Week 12: Aug 12 - Aug 18):** Prepare and submit final versions of
> > all patch series, incorporating all feedback. Final testing pass.
> > Write blog post update summarizing progress and final state. Code
> > freeze for final evaluation.
> >
> > **Phase 4: Final Evaluation & Wrap-up (Aug 19 - Nov 19, 2025)**
> >
> > * **Focus:** Final submissions, respond to late feedback, ensure
> > project completion.
> > * **Official GSoC Milestone:** November 19, 2025 - Program End Date.
> > * **Activities:**
> > * **(Late Aug - Sept):** Submit final GSoC evaluations. Actively
> > respond to any further comments on submitted patches from the
> > community/maintainers, aiming for merge readiness.
> > * **(Oct - Nov 19):** Monitor mailing list for patch status. Write
> > final GSoC project summary blog post. Continue engaging with the
> > community if interested in further contributions beyond GSoC.
> >
> >
> >
> > Thank You,
> > Jayatheerth
>
> Thanks for your proposal!
>
> - Karthik

Thank you again Karthik, will send the updated proposal in this thread soon.

-Jayatheerth





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux