Re: [PATCH v4 2/2] pack-bitmap: add load corrupt bitmap test

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2025年5月30日 05:20,Taylor Blau <me@xxxxxxxxxxxx> 写道:
> 
> On Sun, May 25, 2025 at 02:43:03AM +0000, Lidong Yan via GitGitGadget wrote:
>> diff --git a/pack-bitmap.c b/pack-bitmap.c
>> index fd19c2255163..39c1c1bc4ce1 100644
>> --- a/pack-bitmap.c
>> +++ b/pack-bitmap.c
>> @@ -34,6 +34,11 @@ struct stored_bitmap {
>> int flags;
>> };
>> 
>> +struct stored_bitmap_tag_pos {
>> + struct stored_bitmap stored;
>> + size_t map_pos;
>> +};
>> +
> 
> Hmm. I was expecting you to add a new member to the stored_bitmap
> structure, not a new structure entirely. Let's read on...
> 
>> /*
>>  * The active bitmap index for a repository. By design, repositories only have
>>  * a single bitmap index available (the index for the biggest packfile in
>> @@ -148,6 +153,7 @@ static int existing_bitmaps_hits_nr;
>> static int existing_bitmaps_misses_nr;
>> static int roots_with_bitmaps_nr;
>> static int roots_without_bitmaps_nr;
>> +static int tag_pos_on_bitmap;
> 
> Why are we only sometimes tagging bitmaps with their position?
> 
>> 
>> static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st)
>> {
>> @@ -314,13 +320,18 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
>>  struct ewah_bitmap *root,
>>  const struct object_id *oid,
>>  struct stored_bitmap *xor_with,
>> -  int flags)
>> +  int flags, size_t map_pos)
>> {
>> struct stored_bitmap *stored;
>> + struct stored_bitmap_tag_pos *tagged;
> 
> OK.
> 
>> khiter_t hash_pos;
>> int ret;
>> 
>> - stored = xmalloc(sizeof(struct stored_bitmap));
>> + tagged = xmalloc(tag_pos_on_bitmap ? sizeof(struct stored_bitmap_tag_pos) :
>> +     sizeof(struct stored_bitmap));
>> + stored = &tagged->stored;
>> + if (tag_pos_on_bitmap)
>> + tagged->map_pos = map_pos;
> 
> I am quite worried about this portion of the diff.
> 
> Here you allocate memory for "tagged" which is a stored_bitmap_tag_pos.
> But the amount of bytes you allocate depends on whether the global
> variable tag_pos_on_bitmap is set or not. If it isn't, then you don't
> allocate enough memory here to hold an entire stored_bitmap_tag_pos
> structure.
> 
> I think within this function you're OK, since you only write into that
> field when tag_pos_on_bitmap is set. But this seems like a recipe for
> disaster if you ever try to read or write into the tagged->map_pos field
> when tag_pos_on_bitmap *isn't* set.
> 
> This happens to work because of where the pointer to the stored_bitmap
> structure lives within the stored_bitmap_tag_pos structure. But this
> seems *extremely* fragile to only save 4 bytes of allocated memory per
> bitmap. Even on a repository with ~1,000 bitmaps (which is rare from my
> experience), you're only saving ~3.91 KiB.

Interesting calculation indeed — I hadn't thought about the actual memory
savings that way. I agree it’s not worth the added fragility just to save ~4 KiB
in such rare cases. I’ll go ahead and remove the struct stored_bitmap_tagged_pos.

> 
> I would expect something more like the following (based on top of your
> patch here):
> 
> --- 8< ---
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 39c1c1bc4c..4c3829dba9 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -31,12 +31,8 @@ struct stored_bitmap {
> struct object_id oid;
> struct ewah_bitmap *root;
> struct stored_bitmap *xor;
> - int flags;
> -};
> -
> -struct stored_bitmap_tag_pos {
> - struct stored_bitmap stored;
> size_t map_pos;
> + int flags;
> };
> 
> /*
> @@ -153,7 +149,6 @@ static int existing_bitmaps_hits_nr;
> static int existing_bitmaps_misses_nr;
> static int roots_with_bitmaps_nr;
> static int roots_without_bitmaps_nr;
> -static int tag_pos_on_bitmap;
> 
> static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st)
> {
> @@ -323,17 +318,13 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index,
>  int flags, size_t map_pos)
> {
> struct stored_bitmap *stored;
> - struct stored_bitmap_tag_pos *tagged;
> khiter_t hash_pos;
> int ret;
> 
> - tagged = xmalloc(tag_pos_on_bitmap ? sizeof(struct stored_bitmap_tag_pos) :
> -     sizeof(struct stored_bitmap));
> - stored = &tagged->stored;
> - if (tag_pos_on_bitmap)
> - tagged->map_pos = map_pos;
> + stored = xmalloc(sizeof(struct stored_bitmap));
> stored->root = root;
> stored->xor = xor_with;
> + stored->map_pos = map_pos;
> stored->flags = flags;
> oidcpy(&stored->oid, oid);
> 
> @@ -2878,12 +2869,11 @@ int test_bitmap_commits(struct repository *r)
> int test_bitmap_commits_offset(struct repository *r)
> {
> struct object_id oid;
> - struct stored_bitmap_tag_pos *tagged;
> + struct stored_bitmap *bitmap;
> struct bitmap_index *bitmap_git;
> size_t commit_idx_pos_map_pos, xor_offset_map_pos, flag_map_pos,
> ewah_bitmap_map_pos;
> 
> - tag_pos_on_bitmap = 1;
> bitmap_git = prepare_bitmap_git(r);
> if (!bitmap_git)
> die(_("failed to load bitmap indexes"));
> @@ -2897,9 +2887,9 @@ int test_bitmap_commits_offset(struct repository *r)
> die(_("failed to load bitmap indexes"));
> }
> 
> - kh_foreach (bitmap_git->bitmaps, oid, tagged, {
> - commit_idx_pos_map_pos = tagged->map_pos;
> - xor_offset_map_pos = tagged->map_pos + sizeof(uint32_t);
> + kh_foreach (bitmap_git->bitmaps, oid, bitmap, {
> + commit_idx_pos_map_pos = bitmap->map_pos;
> + xor_offset_map_pos = bitmap->map_pos + sizeof(uint32_t);
> flag_map_pos = xor_offset_map_pos + sizeof(uint8_t);
> ewah_bitmap_map_pos = flag_map_pos + sizeof(uint8_t);
> --- >8 ---
> 
>> stored->root = root;
>> stored->xor = xor_with;
>> stored->flags = flags;
>> @@ -376,10 +387,12 @@ static int load_bitmap_entries_v1(struct bitmap_index *index)
>> struct stored_bitmap *xor_bitmap = NULL;
>> uint32_t commit_idx_pos;
>> struct object_id oid;
>> + size_t entry_map_pos;
>> 
>> if (index->map_size - index->map_pos < 6)
>> return error(_("corrupt ewah bitmap: truncated header for entry %d"), i);
>> 
>> + entry_map_pos = index->map_pos;
> 
> Good. This is important since the read_be32() and read_u8() calls below
> both adjust the value of index->map_pos past the beginning of the bitmap.
> 
>> @@ -869,6 +883,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
>> int xor_flags;
>> khiter_t hash_pos;
>> struct bitmap_lookup_table_xor_item *xor_item;
>> + size_t entry_map_pos;
>> 
>> if (is_corrupt)
>> return NULL;
>> @@ -928,6 +943,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
>> goto corrupt;
>> }
>> 
>> + entry_map_pos = bitmap_git->map_pos;
> 
> Same here.
> 
>> @@ -969,6 +986,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_
>> * Instead, we can skip ahead and immediately read the flags and
>> * ewah bitmap.
>> */
>> + entry_map_pos = bitmap_git->map_pos;
> 
> And here.
> 
>> +int test_bitmap_commits_offset(struct repository *r)
>> +{
>> + struct object_id oid;
>> + struct stored_bitmap_tag_pos *tagged;
>> + struct bitmap_index *bitmap_git;
>> + size_t commit_idx_pos_map_pos, xor_offset_map_pos, flag_map_pos,
>> + ewah_bitmap_map_pos;
>> +
>> + tag_pos_on_bitmap = 1;
>> + bitmap_git = prepare_bitmap_git(r);
>> + if (!bitmap_git)
>> + die(_("failed to load bitmap indexes"));
>> +
> 
> If we either forgot to set this variable here or did so after calling
> prepare_bitmap_git(), then we wouldn't allocate enough memory to store
> the map_pos field in the stored_bitmap_tag_pos structure. When we then
> would try and read that field below, we'd read garbage heap data outside
> of our structure.
> 
>> + /*
>> + * As this function is only used to print bitmap selected
>> + * commits, we don't have to read the commit table.
>> + */
>> + if (bitmap_git->table_lookup) {
>> + if (load_bitmap_entries_v1(bitmap_git) < 0)
>> + die(_("failed to load bitmap indexes"));
>> + }
> 
> This comment suggests that we can avoid reading the commit table
> altogether. Indeed, calling load_bitmap_entries_v1() here does that,
> since it is not called when loading a bitmap that has a lookup table.
> 
> So I think the behavior here is correct, but the comment is misleading.
> I suspect that the confusion would be resolved by instead writing:
> 
>    /*
>     * Since this function needs to know the position of each individual
>     * bitmap, bypass the commit lookup table (if one exists) by forcing
>     * the bitmap to eagerly load its entries.
>     */
> 
> I think this is copy-and-paste from 28cd730680 (pack-bitmap: prepare to
> read lookup table extension, 2022-08-14) via the 'test_bitmap_commits()'
> function immediately above this one. I think both would benefit from
> some clean-up, since this comment is equally misleading in that
> function.
> 
> For your purposes, I would either:
> 
> - remove or (preferably) reword the comment in your new function,
>   leaving the one in test_bitmap_commits() as-is, or
> 
> - reword the comment in test_bitmap_commits() to be more like the one
>   above, via a preparatory commit, and then introduce the new function
>   using the same wording.
> 
> Between the two, I think the latter is preferable.

I will add a new commit reword the comment before the last addt-test-case commit.

> 
> As an aside, I think that for bitmaps that do have a commit lookup
> table, you could go slightly faster here by walking over that portion of
> the *.bitmap file, since it directly encodes the information you're
> interested in here. But I would avoid doing that, since it too seems
> brittle and I would like to avoid having two separate spots that each
> implement reading the commit table format.
> 
>> + kh_foreach (bitmap_git->bitmaps, oid, tagged, {
>> + commit_idx_pos_map_pos = tagged->map_pos;
> 
> OK, and here's where we pull out the actual position of the selected
> commit's bitmap.
> 
>> + xor_offset_map_pos = tagged->map_pos + sizeof(uint32_t);
>> + flag_map_pos = xor_offset_map_pos + sizeof(uint8_t);
>> + ewah_bitmap_map_pos = flag_map_pos + sizeof(uint8_t);
>> +
>> + printf_ln("%s %"PRIuMAX" %"PRIuMAX" %"PRIuMAX" %"PRIuMAX,
>> +  oid_to_hex(&oid),
>> +  (uintmax_t)commit_idx_pos_map_pos,
>> +  (uintmax_t)xor_offset_map_pos,
>> +  (uintmax_t)flag_map_pos,
>> +  (uintmax_t)ewah_bitmap_map_pos);
> 
> Hmm. We print more information here than just the map_pos. This is
> brittle if the on-disk format changes (e.g., to store the XOR offsets in
> some other part of the bitmap). But hopefully future updates to the
> bitmap format will come with updates to this function as well ;-).
> 
> It feels somewhat unsatisfying to print output like:
> 
>    $COMMIT_OID <map_pos> <map_pos+4> <map_pos+5> <map_pos+6>
> 
> , but I think it makes sense here for a couple of reasons:
> 
> - If we just print the <map_pos>, then the test is responsible for
>   knowing the distance between that and the XOR offset, which extends
>   the brittleness to the test code
> 
> - likewise, if we print out just the map_pos and the positions of the
>   XOR offset, it feels strange to omit the others.
> 
>> diff --git a/pack-bitmap.h b/pack-bitmap.h
>> index 382d39499af2..96880ba3d72d 100644
>> --- a/pack-bitmap.h
>> +++ b/pack-bitmap.h
>> @@ -81,6 +81,7 @@ void traverse_bitmap_commit_list(struct bitmap_index *,
>> show_reachable_fn show_reachable);
>> void test_bitmap_walk(struct rev_info *revs);
>> int test_bitmap_commits(struct repository *r);
>> +int test_bitmap_commits_offset(struct repository *r);
>> int test_bitmap_hashes(struct repository *r);
>> int test_bitmap_pseudo_merges(struct repository *r);
>> int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n);
>> diff --git a/t/helper/test-bitmap.c b/t/helper/test-bitmap.c
>> index 3f23f2107268..65a1ab29192b 100644
>> --- a/t/helper/test-bitmap.c
>> +++ b/t/helper/test-bitmap.c
>> @@ -10,6 +10,11 @@ static int bitmap_list_commits(void)
>> return test_bitmap_commits(the_repository);
>> }
>> 
>> +static int bitmap_list_commits_offset(void)
>> +{
>> + return test_bitmap_commits_offset(the_repository);
>> +}
>> +
>> static int bitmap_dump_hashes(void)
>> {
>> return test_bitmap_hashes(the_repository);
>> @@ -36,6 +41,8 @@ int cmd__bitmap(int argc, const char **argv)
>> 
>> if (argc == 2 && !strcmp(argv[1], "list-commits"))
>> return bitmap_list_commits();
>> + if (argc == 2 && !strcmp(argv[1], "list-commits-offset"))
>> + return bitmap_list_commits_offset();
> 
> All of the scaffolding here looks good.
> 
> This new mode reads a little awkwardly to me (but may not to others, in
> which case I am happy to back away from the following suggestion). Can
> we either call this 'list-commits-with-offset' or 'list-commit-offsets'?
> I have a vague preference towards the former since the new mode has a
> string prefix matching the existing mode.
> 
>> + test_expect_success 'load corrupt bitmap' '
>> + rm -fr repo &&
>> + git init repo &&
>> + test_when_finished "rm -fr repo" &&
>> + (
>> + cd repo &&
>> + git config pack.writeBitmapLookupTable '"$writeLookupTable"' &&
>> +
>> + test_commit base &&
>> +
>> + git repack -adb &&
>> + bitmap="$(ls .git/objects/pack/pack-*.bitmap)" &&
>> + chmod +w $bitmap &&
>> +
>> + read oid commit_off xor_off flag_off ewah_off <<-EOF &&
>> + $(test-tool bitmap list-commits-offset | head -n 1)
> 
> We avoid putting 'git' or 'test-tool' on the left-hand side of a pipe,
> since we want to avoid squelching any errors / segfaults from our code.
> 
> How about (assuming the rename above):
> 
>    test-tool bitmap list-commit-offsets >offsets &&
>    xor_off="$(head -n1 offsets | awk '{print $3}')" &&
>    ...
> 
> ?
> 
>> + printf '\161' |
>> + dd of=$bitmap count=1 bs=1 conv=notrunc seek=$xor_off &&
>> +
>> +
>> + git rev-list --count HEAD > expect &&
>> + git rev-list --use-bitmap-index --count HEAD > actual &&
> 
> Using --count can mask failures in the bitmap code since it only guards
> you against getting the wrong number of objects, but doesn't guard you
> against getting the right amount of objects in a permuted order. I think
> we'd want just 'git rev-list --objects' here, but note that you have to
> use '--no-object-names' on the non-bitmap side, since rev-list does not
> print out paths when using bitmaps[^1].
> 
> How about:
> 
>    git rev-list --objects --no-object-names HEAD >expect.raw &&
>    git rev-list --objects --use-bitmap-index --no-object-names HEAD \
>> actual.raw &&
> 
>    sort expect.raw >expect &&
>    sort actual.raw >actual &&
> 
>    test_cmp expect actual
> 
> (Note that you also have to sort the output here since rev-list does not
> output objects in a meaningful order when using bitmaps.)
> 
> Thanks,
> Taylor
> 
> [^1]: Not that it matters here since we don't expect to load the bitmap
>  anyway, but it's worth doing regardless.
> 

Good advice, Thanks
Lidong




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux