Kyle Lippincott <spectral@xxxxxxxxxx> writes: >> Can anyone explain what is causing the irreproducibility? Running >> diffoscope is not helpful, since the bundle is compressed and diffoscope >> doesn't seem to know how to untangle it. > > Spent some time on this, and when I followed the instructions, the > diffs were in the pack file portion of the bundle file, different > "tree" objects were produced at different points in the pack file. But > it produces identical bundles if I run `git bundle create` multiple > times in the same clone. My guess is that the non-determinism is > coming from the clone process being multi-threaded, meaning that the > order things are created in the filesystem during the clone, > presumably due to multithreading happening during the clone process, > or maybe during gc? The contents of .git/objects/pack have different > hashes across my two clones, and I haven't investigated why. Yes, my perception is also that the reproducibility problems happens during 'git clone'. Within the same git clone, it is no problem to create a bit-by-bit reproducible git bundle. But if you work in two different clones, I haven't been able to find any set of commands that leads to identical results. FWIW, some other ways to do the clone that I have tried but didn't get to work (of course I may have made some mistake in my attempts): # dumb protocol doesn't repack the objects GIT_SMART_HTTP=0 git clone https://git.savannah.gnu.org/git/gnulib.git # using rsync fetches .git identical as upstream rsync -av git.savannah.gnu.org::git/gnulib.git/ gnulib >> If this is not possible today, what do you think about changes to make >> this work? > > What is your end goal with being able to reproduce the bundles? Good question - I should have made that clear. The end goal is for someone other than me as uploader of the gnulib git bundle to be able re-create it bit-by-bit identical. This pursuit is in the name of improved software security supply-chain security. Compare efforts to make gzip and tarball files reproducible by others: https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html https://www.gnu.org/software/gzip/manual/html_node/Environment.html > Producing an identical bit-for-bit bundle might be doable by doing > some form of sorting of the objects in the pack file, but this would > only get us closer to bit-for-bit reproducibility *on the same machine > and versions of everything*. There could be some changes to git, zlib, > machine architecture, etc. that causes deterministic but different > values to be produced. As an example, maybe future versions of zlib > compress better, producing an equal result when decompressed, but a > different compressed result. That is an improvement compared to todays situation where nobody can reproduce the git bundle at all. Being able to reproduce it using the same environment (toolchain) is better. This is similar for reproducible builds of binaries: typically you need to reproduce a similar environment to get reproducible results. /Simon
Attachment:
signature.asc
Description: PGP signature