Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 12, 2025 at 12:02 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
>
> On Wed, Mar 12, 2025 at 11:26:53AM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@xxxxxxxxxxxx> writes:
> >
> > >> So would it be feasible to remember how 199MB cruft pack is lying in
> > >> the object store (i.e. earlier we packed as much as possible), and
> > >> add a logic that says "if there is nothing to expire out of this
> > >> one, do not attempt to repack---this is fine as-is"?
> > > .... So
> > > the majority of packs in this case should all be removed, and the small
> > > amount of cruft data remaining can be repacked into a small number of
> > > packs relatively quickly.
> >
> > Given the above ...
> >
> > > Suppose you have a 100MB cruft limit, and there are two cruft packs in
> > > the repository: one that is 99MB and another that is 1MB in size. Let's
> > > suppose further that if you combine these two packs, the resulting pack
> > > would be exactly 100MB in size.
> > >
> > > Today, repack will say, "I have two packs that sum together to be the
> > > value of --max-cruft-size", and mark them both to be removed (and
> > > replaced with the combined pack generated by pack-objects).
> >
> > ... yes, this logic to reach the above decision is exactly what I
> > said is broken.  Is there no way to fix that?
> >
> > > But if the
> > > combined pack is exactly 100MB, then pack-objects will break the pack
> > > into two just before the 100MB limit, and we'll end up with the same two
> > > packs we started with.
> >
> > If "the majority of packs should all be removed and remainder combined"
> > you stated earlier is true, then this case falls in a tiny minority
> > that we do not have to worry about, doesn't it?
>
> Yeah, it is a niche case. But the more I think about it the more I see
> it your way. I apologize for all of the back-and-forth here, this is
> quite a tricky problem to think through (at least for me), so I
> appreciate your patience.
>
> The original implementation in repack was designed to aggregate smaller
> cruft packs together first until the combined size exceeds the
> threshold. So the above would all be true if no new unreachable objects
> were ever added to the repository, but if another 1MB cruft pack
> appears, then we would:
>
>   - See the first 1MB pack, and decide we can repack it as it's under
>     the 100MB threshold.
>
>   - See the second 1MB pack, and repack it for the similar reasons (this
>     time because 1+1<100, not 1<100).
>
>   - See the 100MB pack, and refuse to repack it because the combined
>     size of 102MB would be over the threshold.
>
> So I think it's reasonable that if we keep the current behavior of
> repacking the smaller ones first that this case is niche enough for me
> to feel OK not worrying about it too much.
>
> And, yes, breaking --max-pack-size when given with --cruft is ugly.
>
> > > But in current Git we will keep repacking
> > > the two together, only to generate the same two packs we started with
> > > forever.
> >
> > Yes.  That is because the logic that decides these packs need to be
> > broken and recombined is flawed.  Maybe it does not have sufficient
> > information to decide that it is no use to attempt combining them,
> > in which case leaving some more info to help the later invocation of
> > repack to tell that it would be useless to attempt combining these
> > packs when you do the initial repack would help, which was what I
> > suggested.  You've thought about the issue much longer than I did,
> > and would be able to come up with better ideas.
>
> I think in the short term I came up with a worse idea than you would
> have ;-).
>
> Probably there is a way to improve this niche case as described above,
> but I think the solution space is probably complicated enough that given
> how narrow of a case it is that it's not worth introducing that much
> complexity.

Would it make sense to break the assumption that --max-cruft-size ==
--max-pack-size and perhaps rename the former?  I think the problem is
that the two imply different things (one is a minimum, the other a
maximum), and thus really should be different values.  E.g.
--combine-cruft-below-size that is set to e.g. half of
--max-pack-size, and then you can continue combining cruft packs
together until they do go above the cruft threshold, while avoiding
actually exceeding the pack size threshold?





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux