RE: Incremental Backup of repositories using Git

<rsbecker@xxxxxxxxxxxxx> · Thu, 8 May 2025 16:06:08 -0400

On May 8, 2025 3:48 PM, Jeff King wrote:
>On Thu, May 08, 2025 at 08:47:47PM +0200, Michal Suchánek wrote:
>
>> If you have one of those filesystems that support deduplication on
>> filesystem level you could make each snapshot as a full repository
>> with all objects unpacked, and the filesystem would deduplicate the
>> objects for you.
>>
>> The downside is that you have no way to do multiple full backups this
>> way, and you would have to use something else for that (such as those
>> bundles, or plain archiving the repository as files in a tar archive
>> or such.
>
>This is tempting, but I suspect that storing the objects unpacked will become
>unfeasibly large, because you are missing out on delta compression in the packfiles.
>You can compare the on-disk and uncompressed sizes of objects in a repo like this:
>
>  git cat-file --batch-all-objects --unordered \
>               --batch-check='%(objectsize:disk) %(objectsize)' |
>  perl -alne '
>    $disk += $F[0];
>    $true += $F[1];
>    END {
>      print "$true / $disk = ", int($true / $disk);
>    }
>  '
>
>It's not entirely fair because the "true" size is missing out on zlib compression that
>loose objects would get. But that's at best going to be about 4:1 (and in practice
>worse, since trees are full of sha1 hashes that don't compress very well).
>
>In my copy of linux.git, that yields ~135G versus ~2.4G, for a factor of 56. Even if we
>grant 4:1 compression from zlib, that's still inflating your on-disk repository by a
>factor of 14.
>
>If you have the patience, you can run:
>
>  git cat-file --batch-all-objects --unordered --batch | gzip | wc -c
>
>to get a better sense of what it looks like with the extra deflate (this is cheating a bit,
>because it will find cross-object compression opportunities which would not be
>there in loose objects storage, but should get you in the right ballpark).
>
>You're probably also paying some inode costs with loose objects (1K trees at the
>root of linux.git all pay 4K or whatever as individual loose objects).
>
>So you're probably much better off with some strategy .keep files. I.e., make a good
>big pack and mark it with .keep, so that it is retained forever.

As a possible alternative, would some kind of information presented via the proposed
git blame-tree series (or call it git annotate-tree perhaps) be useful for this enhancement?
I am not sure what the results will look like, but it might be useful and then cached by
the backup strategy. I'm grasping at straws, though.

--Randall