Re: [PATCH v2 02/13] pack-objects: add --path-walk option

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 6 May 2025 15:39:31 -0400




On 5/2/25 7:21 PM, Taylor Blau wrote:
On Mon, Mar 24, 2025 at 03:22:38PM +0000, Derrick Stolee via GitGitGadget wrote:


+--path-walk::
+	By default, `git pack-objects` walks objects in an order that
+	presents trees and blobs in an order unrelated to the path they
+	appear relative to a commit's root tree. The `--path-walk` option
+	enables a different walking algorithm that organizes trees and
+	blobs by path. This has the potential to improve delta compression
+	especially in the presence of filenames that cause collisions in
+	Git's default name-hash algorithm. Due to changing how the objects
+	are walked, this option is not compatible with `--delta-islands`,
+	`--shallow`, or `--filter`. The `--use-bitmap-index` option will
+	be ignored in the presence of `--path-walk.`
+

This is perhaps a stylistic difference, but I don't think in general it
is necessary to say "by default [...]" to explain some existing
behavior, unless the alternative behavior enabled by the flag is
impossible to understand without contrasting.

In this particular instance, I think that "--path-walk" can be
explained in isolation rather than in contrast to the default behavior.

I would probably write something like:

     --path-walk::
     Walk objects in batches grouped by a common path. This can improve
     delta compression, especially in the case where filenames cause
     collisions in the default name-hash algorithm.

     Incompatible with `--delta-islands`, `--shallow`, and `--filter`. If
     provided, `--use-bitmap-index` will be ignored.

Sure. I'll give this a shot.

+static inline int is_oid_interesting(struct repository *repo,
+				     struct object_id *oid)
+{
+	struct object *o = lookup_object(repo, oid);

This function and its caller tripped me up a little bit while reading,
but I was able to un-confuse myself. Here are some notes that I took
while reading, which perhaps you can validate to tell me whether or not
I'm on the right track.

Here we allow lookup_object() to return NULL, and then immediately
return 0 if it does. But...

+	return o && !(o->flags & UNINTERESTING);
+}
+
+static int add_objects_by_path(const char *path,
+			       struct oid_array *oids,
+			       enum object_type type,
+			       void *data)
+{
+	struct object_entry **delta_list;
+	size_t oe_start = to_pack.nr_objects;
+	size_t oe_end;
+	unsigned int sub_list_size;

Could you help me understand why sub_list_size is an unsigned int and
not a size_t here?

This is based on a similar "list_size" used in ll_find_deltas(), but
since we are starting from scratch here it would be better to use
size_t.

+	unsigned int *processed = data;
+
+	/*
+	 * First, add all objects to the packing data, including the ones
+	 * marked UNINTERESTING (translated to 'exclude') as they can be
+	 * used as delta bases.
+	 */
+	for (size_t i = 0; i < oids->nr; i++) {
+		int exclude;
+		struct object_info oi = OBJECT_INFO_INIT;
+		struct object_id *oid = &oids->oid[i];
+
+		/* Skip objects that do not exist locally. */
+		if (exclude_promisor_objects &&
+		    oid_object_info_extended(the_repository, oid, &oi,
+					     OBJECT_INFO_FOR_PREFETCH) < 0)
+			continue;
+
+		exclude = !is_oid_interesting(the_repository, oid);
+
+		if (exclude && !thin)
+			continue;

...here we only skip calling add_object_entry() if the object is
excluded (which lookup_object() returning NULL would cause above) *and*
we're not using thin packs. That makes sense, since if we have a missing
object, but we're using thin packs, then it's OK to skip.

So I think all of this looks good. However, I find the interface here a
little awkward. The function is called "is_oid_interesting()", but it
determines that fact by checking whether or not the object has the
UNINTERSTING bit set on its flag, and then negating the result. But then
we negate it again here at the caller to obtain whether or not we should
"exclude" the object.

I was wondering if there were other callers of is_oid_interesting() that
don't negate its result. But it doesn't look like there are (at least in
this patch). I wonder if it would be cleaner just to inline this instead
of having the result negated at multiple levels.



+		add_object_entry(oid, type, path, exclude);
+	}
+
+	oe_end = to_pack.nr_objects;
+
+	/* We can skip delta calculations if it is a no-op. */
+	if (oe_end == oe_start || !window)
+		return 0;
+
+	sub_list_size = 0;

I'm nit-picking, but I would imagine that sub_list_nr is probably more
conventional, but this is textbook bike-shedding ;-).

Sure.

+	ALLOC_ARRAY(delta_list, oe_end - oe_start);

ALLOC_ARRAY() will barf if oe_start > oe_end and we wrap around the
range of size_t, but it might be nice to have a sanity check here just
their equality.

I'll look for similar examples.

+	/*
+	 * Find delta bases among this list of objects that all match the same
+	 * path. This causes the delta compression to be interleaved in the
+	 * object walk, which can lead to confusing progress indicators. This is
+	 * also incompatible with threaded delta calculations. In the future,
+	 * consider creating a list of regions in the full to_pack.objects array
+	 * that could be picked up by the threaded delta computation.
+	 */
+	if (sub_list_size && window) {

Do we need to check that window is non-zero here? It looks like that
check already happens above.

If we've already checked, then

+		QSORT(delta_list, sub_list_size, type_size_sort);
+		find_deltas(delta_list, &sub_list_size, window, depth, processed);
+	}
+
+	free(delta_list);

Is there a return path through this function that doesn't allocate
delta_list? I think the answer is "no", but it might be a nice guard
against future refactorings to initialize this field to NULL just in
case.

OK.

+	return 0;
+}
+
+static void get_object_list_path_walk(struct rev_info *revs)
+{
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	unsigned int processed = 0;
+
+	info.revs = revs;
+	info.path_fn = add_objects_by_path;
+	info.path_fn_data = &processed;
+	revs->tag_objects = 1;

What is the purpose of setting revs->tag_objects to 1 here? Do we need
to restore it back to its original value before returning? A comment
here would be useful, I think.



@@ -4237,7 +4340,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)

  	warn_on_object_refname_ambiguity = save_warning;

-	if (use_bitmap_index && !get_object_list_from_bitmap(revs))
+	if (use_bitmap_index && !path_walk && !get_object_list_from_bitmap(revs))

I was going to suggest something like:

     if (path_walk && use_bitmap_index) {
         warning(_("cannot use --use-bitmap-index with --path-walk, ignoring"));
         use_bitmap_index = 0;
     }

, which would avoid the need for this check here and insulate us against
having to make similar changes to future conditionals that care about
use_bitmap_index.

But it looks like such a check *does* already exist below, so I am not
sure I understand why we need to check for path_walk here.

This is probably cruft from an earlier version of the check. Thanks.

+	if (path_walk && filter_options.choice) {
+		warning(_("cannot use --filter with --path-walk"));
+		path_walk = 0;
+	}
+	if (path_walk && use_delta_islands) {
+		warning(_("cannot use delta islands with --path-walk"));
+		path_walk = 0;
+	}
+	if (path_walk && shallow) {
+		warning(_("cannot use --shallow with --path-walk"));
+		path_walk = 0;
+	}

Here's a tiny idea I had while reading this code.

     if (path_walk) {
         const char *option = NULL;
         if (filter_options.choice)
             option = "--filter";
         else if (use_delta_islands)
             option = "--delta-islands";
         else if (shallow)
             option = "--shallow";

         if (option) {
             warning(_("cannot use %s with --path-walk"), option);
             path_walk = 0;
         }
     }

This may be too DRY for your taste, and it does have the disadvantage of
reducing the grep-ability of the error messages. I don't have a strong
feeling about it one way or another, but I figured I'd at least write it
down.
I like the reduction of work on the translators.

Thanks,
-Stolee