On May 8, 2025 3:48 PM, Jeff King wrote: >On Thu, May 08, 2025 at 08:47:47PM +0200, Michal Suchánek wrote: > >> If you have one of those filesystems that support deduplication on >> filesystem level you could make each snapshot as a full repository >> with all objects unpacked, and the filesystem would deduplicate the >> objects for you. >> >> The downside is that you have no way to do multiple full backups this >> way, and you would have to use something else for that (such as those >> bundles, or plain archiving the repository as files in a tar archive >> or such. > >This is tempting, but I suspect that storing the objects unpacked will become >unfeasibly large, because you are missing out on delta compression in the packfiles. >You can compare the on-disk and uncompressed sizes of objects in a repo like this: > > git cat-file --batch-all-objects --unordered \ > --batch-check='%(objectsize:disk) %(objectsize)' | > perl -alne ' > $disk += $F[0]; > $true += $F[1]; > END { > print "$true / $disk = ", int($true / $disk); > } > ' > >It's not entirely fair because the "true" size is missing out on zlib compression that >loose objects would get. But that's at best going to be about 4:1 (and in practice >worse, since trees are full of sha1 hashes that don't compress very well). > >In my copy of linux.git, that yields ~135G versus ~2.4G, for a factor of 56. Even if we >grant 4:1 compression from zlib, that's still inflating your on-disk repository by a >factor of 14. > >If you have the patience, you can run: > > git cat-file --batch-all-objects --unordered --batch | gzip | wc -c > >to get a better sense of what it looks like with the extra deflate (this is cheating a bit, >because it will find cross-object compression opportunities which would not be >there in loose objects storage, but should get you in the right ballpark). > >You're probably also paying some inode costs with loose objects (1K trees at the >root of linux.git all pay 4K or whatever as individual loose objects). > >So you're probably much better off with some strategy .keep files. I.e., make a good >big pack and mark it with .keep, so that it is retained forever. As a possible alternative, would some kind of information presented via the proposed git blame-tree series (or call it git annotate-tree perhaps) be useful for this enhancement? I am not sure what the results will look like, but it might be useful and then cached by the backup strategy. I'm grasping at straws, though. --Randall