Re: [PATCH 06/13] fsck: stop using object_info->type_name strbuf

Patrick Steinhardt <ps@xxxxxx> · Fri, 16 May 2025 11:52:26 +0200

On Fri, May 16, 2025 at 12:49:53AM -0400, Jeff King wrote:
> When fsck-ing a loose object, we use object_info's type_name strbuf to
> record the parsed object type as a string. For most objects this is
> redundant with the object_type enum, but it does let us report the
> string when we encounter an object with an unknown type (for which there
> is no matching enum value).
> 
> There are a few downsides, though:
> 
>   1. The code to report these cases is not actually robust. Since we did
>      not pass a strbuf to unpack_loose_header(), we only retrieved types
>      from headers up to 32 bytes. In longer cases, we'd simply say
>      "object corrupt or missing".
> 
>   2. This is the last caller that uses object_info's type_name strbuf
>      support. It would be nice to refactor it so that we can simplify
>      that code.
> 
>   3. Likewise, we'll check the hash of the object using its unknown type
>      (again, as long as that type is short enough). That depends on the
>      hash_object_file_literally() code, which we'd eventually like to
>      get rid of.

Oh, I'd very much welcome if this code path went away completely.

> So we can simplify things by bailing immediately in read_loose_object()
> when we encounter an unknown type. This has a few user-visible effects:
> 
>   a. Instead of producing a single line of error output like this:
> 
>        error: 26ed13ce3564fbbb44e35bde42c7da717ea004a6: object is of unknown type 'bogus': .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6
> 
>      we'll now issue two lines (the first from read_loose_object() when
>      we see the unparsable header, and the second from the fsck code,
>      since we couldn't read the object):
> 
>        error: unable to parse type from header 'bogus 4' of .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6
>        error: 26ed13ce3564fbbb44e35bde42c7da717ea004a6: object corrupt or missing: .git/objects/26/ed13ce3564fbbb44e35bde42c7da717ea004a6
> 
>      This is a little more verbose, but this sort of error should be
>      rare (such objects are almost impossible to work with, and cannot
>      be transferred between repositories as they are not representable
>      in packfiles). And as a bonus, reporting the broken header in full
>      could help with debugging other cases (e.g., a header like "blob
>      xyzzy\0" would fail in parsing the size, but previously we'd not
>      have showed the offending bytes).

Yup, I would claim this is an improvement, as well.

>   b. An object with an unknown type will be reported as corrupt, without
>      actually doing a hash check. Again, I think this is unlikely to
>      matter in practice since such objects are totally unusable.

Agreed.

> We'll update one fsck test to match the new error strings. And we can
> remove another test that covered the case of an object with an unknown
> type _and_ a hash corruption. Since we'll skip the hash check now in
> this case, the test is no longer interesting.
> 
> Signed-off-by: Jeff King <peff@xxxxxxxx>
> ---
>  builtin/fsck.c  | 13 ++-----------
>  object-file.c   | 12 +++++++++---
>  t/t1450-fsck.sh | 29 +++--------------------------
>  3 files changed, 14 insertions(+), 40 deletions(-)
> 
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 6cac28356c..e7d96a9c8e 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -614,12 +614,11 @@ static void get_default_heads(void)
>  struct for_each_loose_cb
>  {
>  	struct progress *progress;
> -	struct strbuf obj_type;
>  };
>  
> -static int fsck_loose(const struct object_id *oid, const char *path, void *data)
> +static int fsck_loose(const struct object_id *oid, const char *path,
> +		      void *data UNUSED)
>  {
> -	struct for_each_loose_cb *cb_data = data;
>  	struct object *obj;
>  	enum object_type type = OBJ_NONE;
>  	unsigned long size;
> @@ -629,8 +628,6 @@ static int fsck_loose(const struct object_id *oid, const char *path, void *data)
>  	struct object_id real_oid = *null_oid(the_hash_algo);
>  	int err = 0;
>  
> -	strbuf_reset(&cb_data->obj_type);
> -	oi.type_name = &cb_data->obj_type;
>  	oi.sizep = &size;
>  	oi.typep = &type;
>  
> @@ -642,10 +639,6 @@ static int fsck_loose(const struct object_id *oid, const char *path, void *data)
>  			err = error(_("%s: object corrupt or missing: %s"),
>  				    oid_to_hex(oid), path);
>  	}
> -	if (type != OBJ_NONE && type < 0)
> -		err = error(_("%s: object is of unknown type '%s': %s"),
> -			    oid_to_hex(&real_oid), cb_data->obj_type.buf,
> -			    path);

This one is a bit curious. But because we know that we have reported
this error in `read_loose_object()` already we don't need to print the
error over here anymore.

> diff --git a/object-file.c b/object-file.c
> index 1127e154f6..7a35bde96e 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1662,6 +1662,12 @@ int read_loose_object(const char *path,
>  		goto out_inflate;
>  	}
>  
> +	if (*oi->typep < 0) {
> +		error(_("unable to parse type from header '%s' of %s"),
> +		      hdr, path);
> +		goto out_inflate;
> +	}
> +
>  	if (*oi->typep == OBJ_BLOB &&
>  	    *size > repo_settings_get_big_file_threshold(the_repository)) {
>  		if (check_stream_oid(&stream, hdr, *size, path, expected_oid) < 0)

So this is where we report the new error now. Makes sense.

Patrick