On Mon, Mar 24, 2025 at 03:22:46PM +0000, Derrick Stolee via GitGitGadget wrote: > The current implementation is not integrated with threads, but could be > done in a future update. I think that "in a future update" may be worth replacing with "the following commit", as the former suggests (to me) that it may be perused outside of this series. > Since we do not attempt to sort objects by size until after exploring > all trees, we can remove the previous change to t5530 due to a different > error message appearing first. Makes sense. > Signed-off-by: Derrick Stolee <stolee@xxxxxxxxx> > --- > builtin/pack-objects.c | 82 +++++++++++++++++++++++++----------- > pack-objects.h | 12 ++++++ > t/t5300-pack-object.sh | 8 +++- > t/t5530-upload-pack-error.sh | 6 --- > 4 files changed, 75 insertions(+), 33 deletions(-) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 0ea85754c52..d4e05ca4434 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -3236,6 +3236,51 @@ static int should_attempt_deltas(struct object_entry *entry) > return 1; > } > > +static void find_deltas_for_region(struct object_entry *list, > + struct packing_region *region, > + unsigned int *processed) > +{ > + struct object_entry **delta_list; > + uint32_t delta_list_nr = 0; > + > + ALLOC_ARRAY(delta_list, region->nr); > + for (uint32_t i = 0; i < region->nr; i++) { I know that these values are all unsigned, and very unlikely to get near the 32-bit maximum, but I do think we should use size_t here (and likewise when dealing with values that deal with how many entries in a list we've allocated and indexes into that list). > + struct object_entry *entry = list + region->start + i; > + if (should_attempt_deltas(entry)) > + delta_list[delta_list_nr++] = entry; > + } > + > + QSORT(delta_list, delta_list_nr, type_size_sort); > + find_deltas(delta_list, &delta_list_nr, window, depth, processed); > + free(delta_list); > +} The rest of this function (modulo the inline comment removal, which I wrote about below) appear to be a straightforward copy of the previous home of this function. > +static void find_deltas_by_region(struct object_entry *list, > + struct packing_region *regions, > + uint32_t start, uint32_t nr) I imagine that "start" here is setup for the following commit which will parallelize this task across multiple threads, with each thread starting at a different position. I wonder if (as an alternative) we could get away with passing in a (struct packing_region *, size_t) pair and drop "start". I think this can work if you pass in the result of adding whatever the value of "start" would be to to_pack.regions here. That somewhat hides the fact that this code is meant to be run across multiple threads, but in a way that I think is worth doing. It lets the function avoid having to do things like: for (size_t i = start; i < start + nr; i++) and instead do the simpler: for (size_t i = 0; i < nr; i++) since the caller already adjusted the regions pointer for us. As a side-effect, it also means that the call below in prepare_pack() can avoid passing a literal zero. > +{ > + unsigned int processed = 0; > + uint32_t progress_nr; This uses a mix of uint32_t and unsigned int types. Should these all be the same (and/or size_t's)? > - /* > - * Find delta bases among this list of objects that all match the same > - * path. This causes the delta compression to be interleaved in the > - * object walk, which can lead to confusing progress indicators. This is > - * also incompatible with threaded delta calculations. In the future, > - * consider creating a list of regions in the full to_pack.objects array > - * that could be picked up by the threaded delta computation. > - */ Nice, it is very satisfying to see this comment go away ;-). > - if (sub_list_size && window) { > - QSORT(delta_list, sub_list_size, type_size_sort); > - find_deltas(delta_list, &sub_list_size, window, depth, processed); > - } > + *processed += oids->nr; > + display_progress(progress_state, *processed); > > - free(delta_list); > return 0; > } > > diff --git a/pack-objects.h b/pack-objects.h > index d73e3843c92..7ba9f3448fe 100644 > --- a/pack-objects.h > +++ b/pack-objects.h > @@ -119,11 +119,23 @@ struct object_entry { > unsigned ext_base:1; /* delta_idx points outside packlist */ > }; > > +/** This is an extreme nitpick, but I think that "/**" (as opposed to "/*") is preferred, as the former is used for Doxygen-style comments. I feel like I have seen Junio comment on this before, but searching 'f:Junio "/**"' yields a measly 83,000+ results, so I am unlikely to find it ;-). (Not worth rerolling on its own, but if you are rerolling anyway, which is what I gathered from some of your earlier replies, it may be worth picking up along the way.) > + * A packing region is a section of the packing_data.objects array > + * as given by a starting index and a number of elements. > + */ > +struct packing_region { > + uint32_t start; > + uint32_t nr; > +}; Same note here about the uint32_t versus size_t. > + > struct packing_data { > struct repository *repo; > struct object_entry *objects; > uint32_t nr_objects, nr_alloc; > > + struct packing_region *regions; > + uint64_t nr_regions, nr_regions_alloc; Ditto, but this one jumps to uint64_t. What differentiates these two? Thanks, Taylor