Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold

Taylor Blau <me@xxxxxxxxxxxx> · Wed, 12 Mar 2025 11:22:11 -0400

On Tue, Mar 11, 2025 at 02:59:10PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@xxxxxxxxxxxx> writes:
>
> > When generating multiple cruft packs with 'git repack --max-cruft-size',
> > we use 'git pack-objects --cruft --max-pack-size' (with many other
> > elided options), filling in the '--max-pack-size' value with whatever
> > was provided via the '--max-cruft-size' flag.
> >
> > This causes us to generate a pack that is smaller than the specified
> > threshold. This poses a problem since we will never be able to generate
> > a cruft pack that crosses the threshold.
>
> So far I see absolutely *NO* problem described in the above.  The
> user said "I want to chop them into 200MB pieces but do not exceed
> the threshold" and the system honored that wish.
>
> > In effect, this means that we
> > will try and repack its contents over and over again.
>
> The end effect however may be problematic, but isn't it due to the
> way when to repack is determined?  You see 199MB piece of cruft pack
> plus some other cruft data.  You have generated no new cruft and no
> existing cruft expired out, but you do not know these facts until
> you try to repack.  Because 200MB is the limit, you include the
> 199MB one as part of the ones to be recombined into the new cruft
> pack because 199MB is smaller than 200MB and you do not know that
> the reason why it is 199MB is because the earlier repack operation
> found all remaining cruft material to be larger than 1MB; if there
> were a 0.5MB cruft, it may have made it closer to 200MB.
>
> So would it be feasible to remember how 199MB cruft pack is lying in
> the object store (i.e. earlier we packed as much as possible), and
> add a logic that says "if there is nothing to expire out of this
> one, do not attempt to repack---this is fine as-is"?

I had a similar thought when first thinking about multi-cruft packs, but
the line of thinking is somewhat flawed. When we do a pruning GC, the
vast majority of objects should be expired out of the repository,
leaving only the recent ones that have mtime newer than the cutoff. So
the majority of packs in this case should all be removed, and the small
amount of cruft data remaining can be repacked into a small number of
packs relatively quickly.

> > Instead, change the meaning of '--max-pack-size' in pack-objects when
> > combined with '--cruft'. When put together, '--max-pack-size' allows the
> > pack to grow larger than the specified threshold, but only by one
> > additional object.
>
> I do not think that would work well.  You have no control over the
> size of that one additional object---it may weigh more than 100MB,
> combining your 199MB cruft pack with something else to make it ~300MB
> cruft.  In other words, "just a little bit larger" sounds like a
> wishful thinking handwaving.

I think that it is somewhat of a handwave, but I would note that our
current rules around --max-pack-size are not quite as strict as I
originally thought. If you have a single object that is 100MB and your
pack limit is 50MB, then pack-objects will generate a 100MB pack today
containing just that object. So I don't think that our --max-pack-size
rules are quite that strict.

Here is the case that I am worried about:

Suppose you have a 100MB cruft limit, and there are two cruft packs in
the repository: one that is 99MB and another that is 1MB in size. Let's
suppose further that if you combine these two packs, the resulting pack
would be exactly 100MB in size.

Today, repack will say, "I have two packs that sum together to be the
value of --max-cruft-size", and mark them both to be removed (and
replaced with the combined pack generated by pack-objects). But if the
combined pack is exactly 100MB, then pack-objects will break the pack
into two just before the 100MB limit, and we'll end up with the same two
packs we started with.

Ideally we would combine those packs into one that is at most one
object's size larger than the threshold, and the steady state would be
to avoid repacking it further. But in current Git we will keep repacking
the two together, only to generate the same two packs we started with
forever.

So I think a reasonable stop-gap here is to let pack-objects generate
cruft packs with a --max-pack-size that are allowed to grow *just*
beyond the threshold by at most one object. Yes, that object can be
large, and so it's possible that you could end up with a pack that is
significantly larger in size than the threshold, if the one-extra-object
is itself large.

But the point of --max-pack-size in conjunction with --cruft is not in
the original spirit of --max-pack-size, which was to work around
filesystems that don't do well with large files. Instead, the utility
here is to bound the amount of repacking we have to do when generating
cruft packs in repositories that have many unreachable objects.

In other words, if my --max-cruft-size is 1G, and I have 20GB of cruft
data, I am less concerned about generating a pack that is 1.1G in size
than I am about repeatedly repacking the same 20GB over and over again
each time I want to add a single unreachable object.

Thanks,
Taylor