On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > > > ... deltas across path boundaries. This second pass is much faster than a fresh > > pass since the existing deltas are used as a limit for the size of > > potentially new deltas, short-circuiting the checks when the delta size > > exceeds the current-best. > > Very nice. > > > The microsoft/fluentui is a public Javascript repo that suffers from many of > > the name hash collisions as internal repositories I've worked with. Here is > > a comparison of the compressed size and end-to-end time of the repack: > > > > Repack Method Pack Size Time > > --------------------------------------- > > Hash v1 439.4M 87.24s > > Hash v2 161.7M 21.51s > > Path Walk 142.5M 28.16s OK, so microsoft/fluentui benefits from the path-walk approach in the size of the resulting pack, but at the cost of additional time to generate it. > > Less dramatic, but perhaps more standardly structured is the nodejs/node > > repository, with these stats: > > > > Repack Method Pack Size Time > > ------------------------------------------ > > Hash v1 739.9M 71.18s > > Hash v2 764.6M 67.82s > > Path Walk 698.0M 75.10s Same here. > > Even the Linux kernel repository gains some benefits, even though the number > > of hash collisions is relatively low due to a preference for short > > filenames: > > > > Repack Method Pack Size Time > > ------------------------------------------ > > Hash v1 2.5G 554.41s > > Hash v2 2.5G 549.62s > > Path Walk 2.2G 559.00s OK, so here the savings are a little more substantial, and the performance hit isn't too bad. > This third one, v2 not performing much better than v1, is quite > surprising. I'm not sure... I think Stolee's "the number of hash collisions is relatively low due to preference for short filenames" is why v2 behaves so similarly to v1 here. > > The drawbacks of the --path-walk feature is that it will be harder to > > integrate it with bitmap features, specifically delta islands. This is not > > insurmountable, but would require more work, such as a revision walk to > > paint objects with reachability information before using that during delta > > computations. > > > > However, there should still be significant benefits to Git clients trying to > > save space and improve local performance. > > Sure. More experiments and more approaches will eventually give us > overall improvement. I am hoping that we will be able to condense > the result of these different approaches and their combinations into > easy-to-choose-from canned choices (as opposed to a myriad of little > knobs the users need to futz with without really understanding what > they are tweaking). In the above three examples we see some trade-offs between pack size and the time it took to generate it. I think it's worth discussing whether or not the potential benefit of such a trade-off is worth the significant complexity and code that this feature will introduce. (To be clear, I don't have a strong opinion here one way or the other, but I do think that it's at least worth discussing). I wonder how much of the benefits of path-walk over the hash v2 approach could be had by simply widening the pack.window during delta selection? I tried to run a similar experiment as you did above on the microsoft/fluentui repository and got the following: Repack Method Pack Size Time ------------------------------------------ Hash v1 447.2MiB 932.41s Hash v2 154.1MiB 404.35s Hash v2 (window=20) 146.7MiB 472.66s Hash v2 (window=50) 138.3MiB 622.13s Path Walk 140.8MiB 168.86s In your experiment above on the same repository, the path walk feature represents an 11.873% reduction in pack size, but at the cost of a 30.9% regression in runtime. When I set pack.window to "50" (over the default value of "10"), I get a ~10.3% reduction in pack size at the cost of a 54% increase in runtime (relative to just --name-hash-version=2 with the default pack.window settings). But when I set the pack.window to "20", the relative values (again comparing against --name-hash-version=2 with the default pack.window) are 4.8% reduction in pack size and a 16.9% increase in runtime. But these numbers are pretty confusing to me, TBH. The reduction in pack sizes makes sense, and here I see numbers that are on-par with what you noted above for the same repository. But the runtimes are wildly different (e.g., hash v1 takes you just 87s while mine takes 932s). There must be something in our environment that is different. I'm starting with a bare clone of microsoft/fluentui from GitHub, and made several 'cp -al' copies of it for the different experiments. In the penultimate one, I ran: $ time git.compile -c pack.window=50 repack --name-hash-version=2 \ -adF --no-write-bitmap-index , and similarly for the other experiments with appropriate values for pack.window, --name-hash-version, and --path-walk, when applicable. All of this was done on a -O2 build of Git with your patches on top. So I'm not sure what to make of these results. Clearly on my machine something is different that makes path-walk much faster than hash v2. But on your machine it's slower, so I don't know how much I trust the timing results from either machine. In any event, it seems like at least in this example we can get performance that is on-par with path-walk by simply widening the pack.window when using hash v2. On my machine that seems to cost more time than it does for you to the point where it's slower than my path-walk. But I think I need to understand what the differences are here before we can draw any conclusions on the size or timing. If the overwhelming majority of cases where the --path-walk feature presents a significant benefit over hash v2 at various pack.window sizes (where we could get approximately the same reduction in pack size with approximately the same end-to-end runtime of 'git repack'), then I feel we might want to reconsider whether or not the complexity of this feature is worthwhile. But if the --path-walk feature either gives us a significant size benefit that we can't get with hash v2 and a wider pack.window without paying a significant runtime cost (or vice-versa), then this feature would indeed be worthwhile. I also have no idea how representative the above is of your intended use-case, which seems much more oriented around pushes than from-scratch repacks, which would also affect our conclusions here. Thanks, Taylor