Verifying data integrity of two git repositories

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have a bare repository of size 5.7 GB in my local disk.

I need to push this to Azure DevOps. I usually do it with the command "git push --mirror" but unfortunately, Azure DevOps has a single push size limit of 5GB.

So I have to push repos larger than 5GB in chunks.

I used this stackoverflow answer (https://stackoverflow.com/questions/79167276/splitt-git-push-to-azure-devops) asmy basis and created a script to push each branch in batches of commits.

I pushed my repository in batches to lets say remote repo "A".

I did a "git clone --bare" from remote repo A to my local disk. I verified the size of this bare and it seems to be of size 5 GB only.

	i) I counted the number of objects using this command "git rev-list --objects --all | wc -l" in both repos, both are same.

	ii) There is only 1 branch master in both repos and the last commit id of both master branches are matching (read an article that data integrity can be checked like this also since git also works like Blockchain)

	iii) git fsck --full in both repos,  both gave the same output: 

		Checking object directories: 100% (256/256), done.
		Checking objects: 100% (10793794/10793794), done.
		Checking connectivity: 10793794, done.

		But original repo on disk had this extra line in the end (which the remote bare on disk did not display)

		Verifying commits in commit graph: 100% (1351940/1351940), done.
	
	iv) I create a bundle of the original repo on disk using command "git bundle create repo.bundle --all" and then in the remote cloned repo on disk I ran, "git bundle verify ../repo.bundle". Output:

		The bundle contains these 883 refs:
		<All Refs>
		The bundle records a complete history.
		The bundle uses this hash algorithm: sha1
		/home/repo.bundle is okay

	ii) I checked the repo size using this command "git count-objects -vH", the size-pack differs (original repo says 5.62 GB and the remote cloned repo on disk says 4.93 GB)

Note: My repository does not have lfs/objects also. So I do not have any lfs objects to begin with. So that is out of the question.

Why is there a change in size? Also how do I validate if two repos are the same or not?

Script being used to push in batches of commits:

#!/bin/bash
set -e

# === CONFIGURATION ===
RepositoryFolderPathForBareCloneBAK="/root/linux"
BackupRepositoryHttpsURL="<REMOTE_URL> "
remoteName="origin"
maxPushSizeInMB=$((4 * 1024)) # 4GB
splitPushCommitsCount=35000
splitPush=false

ALocation=$(pwd)

if [ ! -d "$RepositoryFolderPathForBareCloneBAK" ]; then
    echo "Error: Bare clone folder not found at $RepositoryFolderPathForBareCloneBAK"
    exit 1
fi

cd "$RepositoryFolderPathForBareCloneBAK"
git config http.postBuffer 524288000

doSplitPush=$splitPush

# Check repo size and decide whether to split push
if [ "$doSplitPush" = false ]; then
    echo "Checking repository size..."
    repositorySize=0
    while read -r line; do
        echo "$line"
        if [[ "$line" =~ ^size-pack:\ ([0-9]+(\.[0-9]+)?)\ ([A-Za-z]+) ]]; then
            value=${BASH_REMATCH[1]}
            unit=${BASH_REMATCH[3]}
            case "$unit" in
                bytes) repositorySize=$(echo "$value / 1024 / 1024" | bc) ;;
                KiB)   repositorySize=$(echo "$value / 1024" | bc) ;;
                MiB)   repositorySize=$(echo "$value" | bc) ;;
                GiB)   repositorySize=$(echo "$value * 1024" | bc) ;;
                *)     repositorySize=$(echo "$value" | bc) ;;
            esac
        fi
    done < <(git count-objects -vH)

    # Round down to integer
    repositorySize=${repositorySize%.*}

    echo "Repo size: $repositorySize MiB"

    if [ "$repositorySize" -ge "$maxPushSizeInMB" ]; then
        doSplitPush=true
    fi
fi

# Unset mirror config to allow partial pushes if needed
if git config --get remote.origin.mirror >/dev/null; then
    git config --unset remote.origin.mirror
fi

# Setup remote
NewREMOTE="push_remote"
if git remote | grep -q "$NewREMOTE"; then
    git remote remove "$NewREMOTE"
fi
git remote add "$NewREMOTE" "$BackupRepositoryHttpsURL"

if [ "$doSplitPush" = false ]; then
    echo "Performing full push to $BackupRepositoryHttpsURL"
    git push "$NewREMOTE" --mirror
else
    echo "Performing split push to $BackupRepositoryHttpsURL"

    git for-each-ref --format="%(refname)" --sort='authordate' | while read -r ref; do
        if [[ "$ref" == refs/heads/* ]]; then
            BRANCH="${ref#refs/heads/}"
            echo "Processing branch: $BRANCH"

            git symbolic-ref HEAD "$ref"

            if git show-ref --quiet --verify "refs/remotes/$NewREMOTE/$BRANCH"; then
                range="$NewREMOTE/$BRANCH..HEAD"
            else
                range="HEAD"
            fi

            n=$(git log --first-parent --format="format:x" $range | wc -l)
            echo "$n commits to push"

            splitPushCommitsCount=$(( (maxPushSizeInMB * n) / repositorySize ))
            [ "$splitPushCommitsCount" -gt 20000 ] && splitPushCommitsCount=20000

            echo "Calculated splitPushCommitsCount: $splitPushCommitsCount"

            if [ "$n" -gt 0 ]; then
                loopCount=$((n / splitPushCommitsCount))
                for ((i=1; i<=loopCount; i++)); do
                    h=$(git log --first-parent --reverse --format=format:%H --skip $((n - (i * splitPushCommitsCount))) -n1)
                    echo "Batch commit: $h"
                    git push "$NewREMOTE" --force "$h:refs/heads/$BRANCH"
                    echo "sleeping for 5 minutes"
                    sleep 300
                done
                echo "Final push: HEAD:refs/heads/$BRANCH"
                git push "$NewREMOTE" --force "HEAD:refs/heads/$BRANCH"
            else
                echo "No commits to push for $BRANCH"
            fi
        fi
    done

    echo "Pushing tags"
    git push "$NewREMOTE" --force 'refs/tags/*'

    echo "Pushing replace refs (if any)"
    git push "$NewREMOTE" --force 'refs/replace/*'
fi

# === LFS Push ===
echo "Pushing Git LFS objects..."
Get_LFS_Objects() {
    lfs_objects_dir="$1/lfs/objects"
    if [ -d "$lfs_objects_dir" ]; then
        lfs_objects=$(find "$lfs_objects_dir" -type f -printf "%f ")
        if [ -z "$lfs_objects" ]; then
            lfs_objects="NO_OBJECTS"
        fi
    else
        lfs_objects="NO_OBJECTS"
    fi
}
Get_LFS_Objects "$RepositoryFolderPathForBareCloneBAK"
if [[ "$lfs_objects" != "NO_OBJECTS" ]]; then
    LFS_SPECIFIER="--object-id $lfs_objects"
    echo "Running lfs"

    git lfs push "$NewREMOTE" $LFS_SPECIFIER
    retCode=$?
    echo "LFS push exited with code: $retCode"
else
    echo "No LFS objects to push."
fi

cd "$ALocation"
echo "All done! Git and LFS data pushed successfully."








[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux