Re: [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects'

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
>
> > ... deltas across path boundaries. This second pass is much faster than a fresh
> > pass since the existing deltas are used as a limit for the size of
> > potentially new deltas, short-circuiting the checks when the delta size
> > exceeds the current-best.
>
> Very nice.
>
> > The microsoft/fluentui is a public Javascript repo that suffers from many of
> > the name hash collisions as internal repositories I've worked with. Here is
> > a comparison of the compressed size and end-to-end time of the repack:
> >
> > Repack Method    Pack Size       Time
> > ---------------------------------------
> > Hash v1             439.4M      87.24s
> > Hash v2             161.7M      21.51s
> > Path Walk           142.5M      28.16s

OK, so microsoft/fluentui benefits from the path-walk approach in the
size of the resulting pack, but at the cost of additional time to
generate it.

> > Less dramatic, but perhaps more standardly structured is the nodejs/node
> > repository, with these stats:
> >
> > Repack Method       Pack Size       Time
> > ------------------------------------------
> > Hash v1                739.9M      71.18s
> > Hash v2                764.6M      67.82s
> > Path Walk              698.0M      75.10s

Same here.

> > Even the Linux kernel repository gains some benefits, even though the number
> > of hash collisions is relatively low due to a preference for short
> > filenames:
> >
> > Repack Method       Pack Size       Time
> > ------------------------------------------
> > Hash v1                  2.5G     554.41s
> > Hash v2                  2.5G     549.62s
> > Path Walk                2.2G     559.00s

OK, so here the savings are a little more substantial, and the
performance hit isn't too bad.

> This third one, v2 not performing much better than v1, is quite
> surprising.

I'm not sure... I think Stolee's "the number of hash collisions is
relatively low due to preference for short filenames" is why v2 behaves
so similarly to v1 here.

> > The drawbacks of the --path-walk feature is that it will be harder to
> > integrate it with bitmap features, specifically delta islands. This is not
> > insurmountable, but would require more work, such as a revision walk to
> > paint objects with reachability information before using that during delta
> > computations.
> >
> > However, there should still be significant benefits to Git clients trying to
> > save space and improve local performance.
>
> Sure.  More experiments and more approaches will eventually give us
> overall improvement.  I am hoping that we will be able to condense
> the result of these different approaches and their combinations into
> easy-to-choose-from canned choices (as opposed to a myriad of little
> knobs the users need to futz with without really understanding what
> they are tweaking).

In the above three examples we see some trade-offs between pack size and
the time it took to generate it. I think it's worth discussing whether
or not the potential benefit of such a trade-off is worth the
significant complexity and code that this feature will introduce. (To be
clear, I don't have a strong opinion here one way or the other, but I do
think that it's at least worth discussing).

I wonder how much of the benefits of path-walk over the hash v2 approach
could be had by simply widening the pack.window during delta selection?

I tried to run a similar experiment as you did above on the
microsoft/fluentui repository and got the following:

    Repack Method       Pack Size       Time
    ------------------------------------------
    Hash v1              447.2MiB      932.41s
    Hash v2              154.1MiB      404.35s
    Hash v2 (window=20)  146.7MiB      472.66s
    Hash v2 (window=50)  138.3MiB      622.13s
    Path Walk            140.8MiB      168.86s

In your experiment above on the same repository, the path walk feature
represents an 11.873% reduction in pack size, but at the cost of a 30.9%
regression in runtime.

When I set pack.window to "50" (over the default value of "10"), I get a
~10.3% reduction in pack size at the cost of a 54% increase in runtime
(relative to just --name-hash-version=2 with the default pack.window
settings).

But when I set the pack.window to "20", the relative values (again
comparing against --name-hash-version=2 with the default pack.window)
are 4.8% reduction in pack size and a 16.9% increase in runtime.

But these numbers are pretty confusing to me, TBH. The reduction in pack
sizes makes sense, and here I see numbers that are on-par with what you
noted above for the same repository. But the runtimes are wildly
different (e.g., hash v1 takes you just 87s while mine takes 932s).

There must be something in our environment that is different. I'm
starting with a bare clone of microsoft/fluentui from GitHub, and made
several 'cp -al' copies of it for the different experiments. In the
penultimate one, I ran:

    $ time git.compile -c pack.window=50 repack --name-hash-version=2 \
        -adF --no-write-bitmap-index

, and similarly for the other experiments with appropriate values for
pack.window, --name-hash-version, and --path-walk, when applicable. All
of this was done on a -O2 build of Git with your patches on top.

So I'm not sure what to make of these results. Clearly on my machine
something is different that makes path-walk much faster than hash v2.
But on your machine it's slower, so I don't know how much I trust the
timing results from either machine.

In any event, it seems like at least in this example we can get
performance that is on-par with path-walk by simply widening the
pack.window when using hash v2. On my machine that seems to cost more
time than it does for you to the point where it's slower than my
path-walk. But I think I need to understand what the differences are
here before we can draw any conclusions on the size or timing.

If the overwhelming majority of cases where the --path-walk feature
presents a significant benefit over hash v2 at various pack.window sizes
(where we could get approximately the same reduction in pack size with
approximately the same end-to-end runtime of 'git repack'), then I feel
we might want to reconsider whether or not the complexity of this feature
is worthwhile.

But if the --path-walk feature either gives us a significant size
benefit that we can't get with hash v2 and a wider pack.window without
paying a significant runtime cost (or vice-versa), then this feature
would indeed be worthwhile.

I also have no idea how representative the above is of your intended
use-case, which seems much more oriented around pushes than from-scratch
repacks, which would also affect our conclusions here.

Thanks,
Taylor




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux