On Wed, Mar 12, 2025 at 12:02 PM Taylor Blau <me@xxxxxxxxxxxx> wrote: > > On Wed, Mar 12, 2025 at 11:26:53AM -0700, Junio C Hamano wrote: > > Taylor Blau <me@xxxxxxxxxxxx> writes: > > > > >> So would it be feasible to remember how 199MB cruft pack is lying in > > >> the object store (i.e. earlier we packed as much as possible), and > > >> add a logic that says "if there is nothing to expire out of this > > >> one, do not attempt to repack---this is fine as-is"? > > > .... So > > > the majority of packs in this case should all be removed, and the small > > > amount of cruft data remaining can be repacked into a small number of > > > packs relatively quickly. > > > > Given the above ... > > > > > Suppose you have a 100MB cruft limit, and there are two cruft packs in > > > the repository: one that is 99MB and another that is 1MB in size. Let's > > > suppose further that if you combine these two packs, the resulting pack > > > would be exactly 100MB in size. > > > > > > Today, repack will say, "I have two packs that sum together to be the > > > value of --max-cruft-size", and mark them both to be removed (and > > > replaced with the combined pack generated by pack-objects). > > > > ... yes, this logic to reach the above decision is exactly what I > > said is broken. Is there no way to fix that? > > > > > But if the > > > combined pack is exactly 100MB, then pack-objects will break the pack > > > into two just before the 100MB limit, and we'll end up with the same two > > > packs we started with. > > > > If "the majority of packs should all be removed and remainder combined" > > you stated earlier is true, then this case falls in a tiny minority > > that we do not have to worry about, doesn't it? > > Yeah, it is a niche case. But the more I think about it the more I see > it your way. I apologize for all of the back-and-forth here, this is > quite a tricky problem to think through (at least for me), so I > appreciate your patience. > > The original implementation in repack was designed to aggregate smaller > cruft packs together first until the combined size exceeds the > threshold. So the above would all be true if no new unreachable objects > were ever added to the repository, but if another 1MB cruft pack > appears, then we would: > > - See the first 1MB pack, and decide we can repack it as it's under > the 100MB threshold. > > - See the second 1MB pack, and repack it for the similar reasons (this > time because 1+1<100, not 1<100). > > - See the 100MB pack, and refuse to repack it because the combined > size of 102MB would be over the threshold. > > So I think it's reasonable that if we keep the current behavior of > repacking the smaller ones first that this case is niche enough for me > to feel OK not worrying about it too much. > > And, yes, breaking --max-pack-size when given with --cruft is ugly. > > > > But in current Git we will keep repacking > > > the two together, only to generate the same two packs we started with > > > forever. > > > > Yes. That is because the logic that decides these packs need to be > > broken and recombined is flawed. Maybe it does not have sufficient > > information to decide that it is no use to attempt combining them, > > in which case leaving some more info to help the later invocation of > > repack to tell that it would be useless to attempt combining these > > packs when you do the initial repack would help, which was what I > > suggested. You've thought about the issue much longer than I did, > > and would be able to come up with better ideas. > > I think in the short term I came up with a worse idea than you would > have ;-). > > Probably there is a way to improve this niche case as described above, > but I think the solution space is probably complicated enough that given > how narrow of a case it is that it's not worth introducing that much > complexity. Would it make sense to break the assumption that --max-cruft-size == --max-pack-size and perhaps rename the former? I think the problem is that the two imply different things (one is a minimum, the other a maximum), and thus really should be different values. E.g. --combine-cruft-below-size that is set to e.g. half of --max-pack-size, and then you can continue combining cruft packs together until they do go above the cruft threshold, while avoiding actually exceeding the pack size threshold?