Re: [PATCH v2 10/13] pack-objects: refactor path-walk delta phase

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 24, 2025 at 03:22:46PM +0000, Derrick Stolee via GitGitGadget wrote:
> The current implementation is not integrated with threads, but could be
> done in a future update.

I think that "in a future update" may be worth replacing with "the
following commit", as the former suggests (to me) that it may be perused
outside of this series.

> Since we do not attempt to sort objects by size until after exploring
> all trees, we can remove the previous change to t5530 due to a different
> error message appearing first.

Makes sense.

> Signed-off-by: Derrick Stolee <stolee@xxxxxxxxx>
> ---
>  builtin/pack-objects.c       | 82 +++++++++++++++++++++++++-----------
>  pack-objects.h               | 12 ++++++
>  t/t5300-pack-object.sh       |  8 +++-
>  t/t5530-upload-pack-error.sh |  6 ---
>  4 files changed, 75 insertions(+), 33 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 0ea85754c52..d4e05ca4434 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3236,6 +3236,51 @@ static int should_attempt_deltas(struct object_entry *entry)
>  	return 1;
>  }
>
> +static void find_deltas_for_region(struct object_entry *list,
> +				   struct packing_region *region,
> +				   unsigned int *processed)
> +{
> +	struct object_entry **delta_list;
> +	uint32_t delta_list_nr = 0;
> +
> +	ALLOC_ARRAY(delta_list, region->nr);
> +	for (uint32_t i = 0; i < region->nr; i++) {

I know that these values are all unsigned, and very unlikely to get near
the 32-bit maximum, but I do think we should use size_t here (and
likewise when dealing with values that deal with how many entries in a
list we've allocated and indexes into that list).

> +		struct object_entry *entry = list + region->start + i;
> +		if (should_attempt_deltas(entry))
> +			delta_list[delta_list_nr++] = entry;
> +	}
> +
> +	QSORT(delta_list, delta_list_nr, type_size_sort);
> +	find_deltas(delta_list, &delta_list_nr, window, depth, processed);
> +	free(delta_list);
> +}

The rest of this function (modulo the inline comment removal, which I
wrote about below) appear to be a straightforward copy of the previous
home of this function.

> +static void find_deltas_by_region(struct object_entry *list,
> +				  struct packing_region *regions,
> +				  uint32_t start, uint32_t nr)

I imagine that "start" here is setup for the following commit which will
parallelize this task across multiple threads, with each thread starting
at a different position.

I wonder if (as an alternative) we could get away with passing in a
(struct packing_region *, size_t) pair and drop "start". I think this
can work if you pass in the result of adding whatever the value of
"start" would be to to_pack.regions here.

That somewhat hides the fact that this code is meant to be run across
multiple threads, but in a way that I think is worth doing. It lets the
function avoid having to do things like:

  for (size_t i = start; i < start + nr; i++)

and instead do the simpler:

  for (size_t i = 0; i < nr; i++)

since the caller already adjusted the regions pointer for us. As a
side-effect, it also means that the call below in prepare_pack() can
avoid passing a literal zero.

> +{
> +	unsigned int processed = 0;
> +	uint32_t progress_nr;

This uses a mix of uint32_t and unsigned int types. Should these all be
the same (and/or size_t's)?

> -	/*
> -	 * Find delta bases among this list of objects that all match the same
> -	 * path. This causes the delta compression to be interleaved in the
> -	 * object walk, which can lead to confusing progress indicators. This is
> -	 * also incompatible with threaded delta calculations. In the future,
> -	 * consider creating a list of regions in the full to_pack.objects array
> -	 * that could be picked up by the threaded delta computation.
> -	 */

Nice, it is very satisfying to see this comment go away ;-).

> -	if (sub_list_size && window) {
> -		QSORT(delta_list, sub_list_size, type_size_sort);
> -		find_deltas(delta_list, &sub_list_size, window, depth, processed);
> -	}
> +	*processed += oids->nr;
> +	display_progress(progress_state, *processed);
>
> -	free(delta_list);
>  	return 0;
>  }
>
> diff --git a/pack-objects.h b/pack-objects.h
> index d73e3843c92..7ba9f3448fe 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -119,11 +119,23 @@ struct object_entry {
>  	unsigned ext_base:1; /* delta_idx points outside packlist */
>  };
>
> +/**

This is an extreme nitpick, but I think that "/**" (as opposed to "/*")
is preferred, as the former is used for Doxygen-style comments. I feel
like I have seen Junio comment on this before, but searching 'f:Junio
"/**"' yields a measly 83,000+ results, so I am unlikely to find it ;-).

(Not worth rerolling on its own, but if you are rerolling anyway, which
is what I gathered from some of your earlier replies, it may be worth
picking up along the way.)

> + * A packing region is a section of the packing_data.objects array
> + * as given by a starting index and a number of elements.
> + */
> +struct packing_region {
> +	uint32_t start;
> +	uint32_t nr;
> +};

Same note here about the uint32_t versus size_t.
> +
>  struct packing_data {
>  	struct repository *repo;
>  	struct object_entry *objects;
>  	uint32_t nr_objects, nr_alloc;
>
> +	struct packing_region *regions;
> +	uint64_t nr_regions, nr_regions_alloc;

Ditto, but this one jumps to uint64_t. What differentiates these two?

Thanks,
Taylor




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux