On Tue, Mar 11, 2025 at 02:59:10PM -0700, Junio C Hamano wrote: > Taylor Blau <me@xxxxxxxxxxxx> writes: > > > When generating multiple cruft packs with 'git repack --max-cruft-size', > > we use 'git pack-objects --cruft --max-pack-size' (with many other > > elided options), filling in the '--max-pack-size' value with whatever > > was provided via the '--max-cruft-size' flag. > > > > This causes us to generate a pack that is smaller than the specified > > threshold. This poses a problem since we will never be able to generate > > a cruft pack that crosses the threshold. > > So far I see absolutely *NO* problem described in the above. The > user said "I want to chop them into 200MB pieces but do not exceed > the threshold" and the system honored that wish. > > > In effect, this means that we > > will try and repack its contents over and over again. > > The end effect however may be problematic, but isn't it due to the > way when to repack is determined? You see 199MB piece of cruft pack > plus some other cruft data. You have generated no new cruft and no > existing cruft expired out, but you do not know these facts until > you try to repack. Because 200MB is the limit, you include the > 199MB one as part of the ones to be recombined into the new cruft > pack because 199MB is smaller than 200MB and you do not know that > the reason why it is 199MB is because the earlier repack operation > found all remaining cruft material to be larger than 1MB; if there > were a 0.5MB cruft, it may have made it closer to 200MB. > > So would it be feasible to remember how 199MB cruft pack is lying in > the object store (i.e. earlier we packed as much as possible), and > add a logic that says "if there is nothing to expire out of this > one, do not attempt to repack---this is fine as-is"? I had a similar thought when first thinking about multi-cruft packs, but the line of thinking is somewhat flawed. When we do a pruning GC, the vast majority of objects should be expired out of the repository, leaving only the recent ones that have mtime newer than the cutoff. So the majority of packs in this case should all be removed, and the small amount of cruft data remaining can be repacked into a small number of packs relatively quickly. > > Instead, change the meaning of '--max-pack-size' in pack-objects when > > combined with '--cruft'. When put together, '--max-pack-size' allows the > > pack to grow larger than the specified threshold, but only by one > > additional object. > > I do not think that would work well. You have no control over the > size of that one additional object---it may weigh more than 100MB, > combining your 199MB cruft pack with something else to make it ~300MB > cruft. In other words, "just a little bit larger" sounds like a > wishful thinking handwaving. I think that it is somewhat of a handwave, but I would note that our current rules around --max-pack-size are not quite as strict as I originally thought. If you have a single object that is 100MB and your pack limit is 50MB, then pack-objects will generate a 100MB pack today containing just that object. So I don't think that our --max-pack-size rules are quite that strict. Here is the case that I am worried about: Suppose you have a 100MB cruft limit, and there are two cruft packs in the repository: one that is 99MB and another that is 1MB in size. Let's suppose further that if you combine these two packs, the resulting pack would be exactly 100MB in size. Today, repack will say, "I have two packs that sum together to be the value of --max-cruft-size", and mark them both to be removed (and replaced with the combined pack generated by pack-objects). But if the combined pack is exactly 100MB, then pack-objects will break the pack into two just before the 100MB limit, and we'll end up with the same two packs we started with. Ideally we would combine those packs into one that is at most one object's size larger than the threshold, and the steady state would be to avoid repacking it further. But in current Git we will keep repacking the two together, only to generate the same two packs we started with forever. So I think a reasonable stop-gap here is to let pack-objects generate cruft packs with a --max-pack-size that are allowed to grow *just* beyond the threshold by at most one object. Yes, that object can be large, and so it's possible that you could end up with a pack that is significantly larger in size than the threshold, if the one-extra-object is itself large. But the point of --max-pack-size in conjunction with --cruft is not in the original spirit of --max-pack-size, which was to work around filesystems that don't do well with large files. Instead, the utility here is to bound the amount of repacking we have to do when generating cruft packs in repositories that have many unreachable objects. In other words, if my --max-cruft-size is 1G, and I have 20GB of cruft data, I am less concerned about generating a pack that is 1.1G in size than I am about repeatedly repacking the same 20GB over and over again each time I want to add a single unreachable object. Thanks, Taylor