2025年5月30日 05:20,Taylor Blau <me@xxxxxxxxxxxx> 写道: > > On Sun, May 25, 2025 at 02:43:03AM +0000, Lidong Yan via GitGitGadget wrote: >> diff --git a/pack-bitmap.c b/pack-bitmap.c >> index fd19c2255163..39c1c1bc4ce1 100644 >> --- a/pack-bitmap.c >> +++ b/pack-bitmap.c >> @@ -34,6 +34,11 @@ struct stored_bitmap { >> int flags; >> }; >> >> +struct stored_bitmap_tag_pos { >> + struct stored_bitmap stored; >> + size_t map_pos; >> +}; >> + > > Hmm. I was expecting you to add a new member to the stored_bitmap > structure, not a new structure entirely. Let's read on... > >> /* >> * The active bitmap index for a repository. By design, repositories only have >> * a single bitmap index available (the index for the biggest packfile in >> @@ -148,6 +153,7 @@ static int existing_bitmaps_hits_nr; >> static int existing_bitmaps_misses_nr; >> static int roots_with_bitmaps_nr; >> static int roots_without_bitmaps_nr; >> +static int tag_pos_on_bitmap; > > Why are we only sometimes tagging bitmaps with their position? > >> >> static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st) >> { >> @@ -314,13 +320,18 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index, >> struct ewah_bitmap *root, >> const struct object_id *oid, >> struct stored_bitmap *xor_with, >> - int flags) >> + int flags, size_t map_pos) >> { >> struct stored_bitmap *stored; >> + struct stored_bitmap_tag_pos *tagged; > > OK. > >> khiter_t hash_pos; >> int ret; >> >> - stored = xmalloc(sizeof(struct stored_bitmap)); >> + tagged = xmalloc(tag_pos_on_bitmap ? sizeof(struct stored_bitmap_tag_pos) : >> + sizeof(struct stored_bitmap)); >> + stored = &tagged->stored; >> + if (tag_pos_on_bitmap) >> + tagged->map_pos = map_pos; > > I am quite worried about this portion of the diff. > > Here you allocate memory for "tagged" which is a stored_bitmap_tag_pos. > But the amount of bytes you allocate depends on whether the global > variable tag_pos_on_bitmap is set or not. If it isn't, then you don't > allocate enough memory here to hold an entire stored_bitmap_tag_pos > structure. > > I think within this function you're OK, since you only write into that > field when tag_pos_on_bitmap is set. But this seems like a recipe for > disaster if you ever try to read or write into the tagged->map_pos field > when tag_pos_on_bitmap *isn't* set. > > This happens to work because of where the pointer to the stored_bitmap > structure lives within the stored_bitmap_tag_pos structure. But this > seems *extremely* fragile to only save 4 bytes of allocated memory per > bitmap. Even on a repository with ~1,000 bitmaps (which is rare from my > experience), you're only saving ~3.91 KiB. Interesting calculation indeed — I hadn't thought about the actual memory savings that way. I agree it’s not worth the added fragility just to save ~4 KiB in such rare cases. I’ll go ahead and remove the struct stored_bitmap_tagged_pos. > > I would expect something more like the following (based on top of your > patch here): > > --- 8< --- > diff --git a/pack-bitmap.c b/pack-bitmap.c > index 39c1c1bc4c..4c3829dba9 100644 > --- a/pack-bitmap.c > +++ b/pack-bitmap.c > @@ -31,12 +31,8 @@ struct stored_bitmap { > struct object_id oid; > struct ewah_bitmap *root; > struct stored_bitmap *xor; > - int flags; > -}; > - > -struct stored_bitmap_tag_pos { > - struct stored_bitmap stored; > size_t map_pos; > + int flags; > }; > > /* > @@ -153,7 +149,6 @@ static int existing_bitmaps_hits_nr; > static int existing_bitmaps_misses_nr; > static int roots_with_bitmaps_nr; > static int roots_without_bitmaps_nr; > -static int tag_pos_on_bitmap; > > static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st) > { > @@ -323,17 +318,13 @@ static struct stored_bitmap *store_bitmap(struct bitmap_index *index, > int flags, size_t map_pos) > { > struct stored_bitmap *stored; > - struct stored_bitmap_tag_pos *tagged; > khiter_t hash_pos; > int ret; > > - tagged = xmalloc(tag_pos_on_bitmap ? sizeof(struct stored_bitmap_tag_pos) : > - sizeof(struct stored_bitmap)); > - stored = &tagged->stored; > - if (tag_pos_on_bitmap) > - tagged->map_pos = map_pos; > + stored = xmalloc(sizeof(struct stored_bitmap)); > stored->root = root; > stored->xor = xor_with; > + stored->map_pos = map_pos; > stored->flags = flags; > oidcpy(&stored->oid, oid); > > @@ -2878,12 +2869,11 @@ int test_bitmap_commits(struct repository *r) > int test_bitmap_commits_offset(struct repository *r) > { > struct object_id oid; > - struct stored_bitmap_tag_pos *tagged; > + struct stored_bitmap *bitmap; > struct bitmap_index *bitmap_git; > size_t commit_idx_pos_map_pos, xor_offset_map_pos, flag_map_pos, > ewah_bitmap_map_pos; > > - tag_pos_on_bitmap = 1; > bitmap_git = prepare_bitmap_git(r); > if (!bitmap_git) > die(_("failed to load bitmap indexes")); > @@ -2897,9 +2887,9 @@ int test_bitmap_commits_offset(struct repository *r) > die(_("failed to load bitmap indexes")); > } > > - kh_foreach (bitmap_git->bitmaps, oid, tagged, { > - commit_idx_pos_map_pos = tagged->map_pos; > - xor_offset_map_pos = tagged->map_pos + sizeof(uint32_t); > + kh_foreach (bitmap_git->bitmaps, oid, bitmap, { > + commit_idx_pos_map_pos = bitmap->map_pos; > + xor_offset_map_pos = bitmap->map_pos + sizeof(uint32_t); > flag_map_pos = xor_offset_map_pos + sizeof(uint8_t); > ewah_bitmap_map_pos = flag_map_pos + sizeof(uint8_t); > --- >8 --- > >> stored->root = root; >> stored->xor = xor_with; >> stored->flags = flags; >> @@ -376,10 +387,12 @@ static int load_bitmap_entries_v1(struct bitmap_index *index) >> struct stored_bitmap *xor_bitmap = NULL; >> uint32_t commit_idx_pos; >> struct object_id oid; >> + size_t entry_map_pos; >> >> if (index->map_size - index->map_pos < 6) >> return error(_("corrupt ewah bitmap: truncated header for entry %d"), i); >> >> + entry_map_pos = index->map_pos; > > Good. This is important since the read_be32() and read_u8() calls below > both adjust the value of index->map_pos past the beginning of the bitmap. > >> @@ -869,6 +883,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_ >> int xor_flags; >> khiter_t hash_pos; >> struct bitmap_lookup_table_xor_item *xor_item; >> + size_t entry_map_pos; >> >> if (is_corrupt) >> return NULL; >> @@ -928,6 +943,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_ >> goto corrupt; >> } >> >> + entry_map_pos = bitmap_git->map_pos; > > Same here. > >> @@ -969,6 +986,7 @@ static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_ >> * Instead, we can skip ahead and immediately read the flags and >> * ewah bitmap. >> */ >> + entry_map_pos = bitmap_git->map_pos; > > And here. > >> +int test_bitmap_commits_offset(struct repository *r) >> +{ >> + struct object_id oid; >> + struct stored_bitmap_tag_pos *tagged; >> + struct bitmap_index *bitmap_git; >> + size_t commit_idx_pos_map_pos, xor_offset_map_pos, flag_map_pos, >> + ewah_bitmap_map_pos; >> + >> + tag_pos_on_bitmap = 1; >> + bitmap_git = prepare_bitmap_git(r); >> + if (!bitmap_git) >> + die(_("failed to load bitmap indexes")); >> + > > If we either forgot to set this variable here or did so after calling > prepare_bitmap_git(), then we wouldn't allocate enough memory to store > the map_pos field in the stored_bitmap_tag_pos structure. When we then > would try and read that field below, we'd read garbage heap data outside > of our structure. > >> + /* >> + * As this function is only used to print bitmap selected >> + * commits, we don't have to read the commit table. >> + */ >> + if (bitmap_git->table_lookup) { >> + if (load_bitmap_entries_v1(bitmap_git) < 0) >> + die(_("failed to load bitmap indexes")); >> + } > > This comment suggests that we can avoid reading the commit table > altogether. Indeed, calling load_bitmap_entries_v1() here does that, > since it is not called when loading a bitmap that has a lookup table. > > So I think the behavior here is correct, but the comment is misleading. > I suspect that the confusion would be resolved by instead writing: > > /* > * Since this function needs to know the position of each individual > * bitmap, bypass the commit lookup table (if one exists) by forcing > * the bitmap to eagerly load its entries. > */ > > I think this is copy-and-paste from 28cd730680 (pack-bitmap: prepare to > read lookup table extension, 2022-08-14) via the 'test_bitmap_commits()' > function immediately above this one. I think both would benefit from > some clean-up, since this comment is equally misleading in that > function. > > For your purposes, I would either: > > - remove or (preferably) reword the comment in your new function, > leaving the one in test_bitmap_commits() as-is, or > > - reword the comment in test_bitmap_commits() to be more like the one > above, via a preparatory commit, and then introduce the new function > using the same wording. > > Between the two, I think the latter is preferable. I will add a new commit reword the comment before the last addt-test-case commit. > > As an aside, I think that for bitmaps that do have a commit lookup > table, you could go slightly faster here by walking over that portion of > the *.bitmap file, since it directly encodes the information you're > interested in here. But I would avoid doing that, since it too seems > brittle and I would like to avoid having two separate spots that each > implement reading the commit table format. > >> + kh_foreach (bitmap_git->bitmaps, oid, tagged, { >> + commit_idx_pos_map_pos = tagged->map_pos; > > OK, and here's where we pull out the actual position of the selected > commit's bitmap. > >> + xor_offset_map_pos = tagged->map_pos + sizeof(uint32_t); >> + flag_map_pos = xor_offset_map_pos + sizeof(uint8_t); >> + ewah_bitmap_map_pos = flag_map_pos + sizeof(uint8_t); >> + >> + printf_ln("%s %"PRIuMAX" %"PRIuMAX" %"PRIuMAX" %"PRIuMAX, >> + oid_to_hex(&oid), >> + (uintmax_t)commit_idx_pos_map_pos, >> + (uintmax_t)xor_offset_map_pos, >> + (uintmax_t)flag_map_pos, >> + (uintmax_t)ewah_bitmap_map_pos); > > Hmm. We print more information here than just the map_pos. This is > brittle if the on-disk format changes (e.g., to store the XOR offsets in > some other part of the bitmap). But hopefully future updates to the > bitmap format will come with updates to this function as well ;-). > > It feels somewhat unsatisfying to print output like: > > $COMMIT_OID <map_pos> <map_pos+4> <map_pos+5> <map_pos+6> > > , but I think it makes sense here for a couple of reasons: > > - If we just print the <map_pos>, then the test is responsible for > knowing the distance between that and the XOR offset, which extends > the brittleness to the test code > > - likewise, if we print out just the map_pos and the positions of the > XOR offset, it feels strange to omit the others. > >> diff --git a/pack-bitmap.h b/pack-bitmap.h >> index 382d39499af2..96880ba3d72d 100644 >> --- a/pack-bitmap.h >> +++ b/pack-bitmap.h >> @@ -81,6 +81,7 @@ void traverse_bitmap_commit_list(struct bitmap_index *, >> show_reachable_fn show_reachable); >> void test_bitmap_walk(struct rev_info *revs); >> int test_bitmap_commits(struct repository *r); >> +int test_bitmap_commits_offset(struct repository *r); >> int test_bitmap_hashes(struct repository *r); >> int test_bitmap_pseudo_merges(struct repository *r); >> int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n); >> diff --git a/t/helper/test-bitmap.c b/t/helper/test-bitmap.c >> index 3f23f2107268..65a1ab29192b 100644 >> --- a/t/helper/test-bitmap.c >> +++ b/t/helper/test-bitmap.c >> @@ -10,6 +10,11 @@ static int bitmap_list_commits(void) >> return test_bitmap_commits(the_repository); >> } >> >> +static int bitmap_list_commits_offset(void) >> +{ >> + return test_bitmap_commits_offset(the_repository); >> +} >> + >> static int bitmap_dump_hashes(void) >> { >> return test_bitmap_hashes(the_repository); >> @@ -36,6 +41,8 @@ int cmd__bitmap(int argc, const char **argv) >> >> if (argc == 2 && !strcmp(argv[1], "list-commits")) >> return bitmap_list_commits(); >> + if (argc == 2 && !strcmp(argv[1], "list-commits-offset")) >> + return bitmap_list_commits_offset(); > > All of the scaffolding here looks good. > > This new mode reads a little awkwardly to me (but may not to others, in > which case I am happy to back away from the following suggestion). Can > we either call this 'list-commits-with-offset' or 'list-commit-offsets'? > I have a vague preference towards the former since the new mode has a > string prefix matching the existing mode. > >> + test_expect_success 'load corrupt bitmap' ' >> + rm -fr repo && >> + git init repo && >> + test_when_finished "rm -fr repo" && >> + ( >> + cd repo && >> + git config pack.writeBitmapLookupTable '"$writeLookupTable"' && >> + >> + test_commit base && >> + >> + git repack -adb && >> + bitmap="$(ls .git/objects/pack/pack-*.bitmap)" && >> + chmod +w $bitmap && >> + >> + read oid commit_off xor_off flag_off ewah_off <<-EOF && >> + $(test-tool bitmap list-commits-offset | head -n 1) > > We avoid putting 'git' or 'test-tool' on the left-hand side of a pipe, > since we want to avoid squelching any errors / segfaults from our code. > > How about (assuming the rename above): > > test-tool bitmap list-commit-offsets >offsets && > xor_off="$(head -n1 offsets | awk '{print $3}')" && > ... > > ? > >> + printf '\161' | >> + dd of=$bitmap count=1 bs=1 conv=notrunc seek=$xor_off && >> + >> + >> + git rev-list --count HEAD > expect && >> + git rev-list --use-bitmap-index --count HEAD > actual && > > Using --count can mask failures in the bitmap code since it only guards > you against getting the wrong number of objects, but doesn't guard you > against getting the right amount of objects in a permuted order. I think > we'd want just 'git rev-list --objects' here, but note that you have to > use '--no-object-names' on the non-bitmap side, since rev-list does not > print out paths when using bitmaps[^1]. > > How about: > > git rev-list --objects --no-object-names HEAD >expect.raw && > git rev-list --objects --use-bitmap-index --no-object-names HEAD \ >> actual.raw && > > sort expect.raw >expect && > sort actual.raw >actual && > > test_cmp expect actual > > (Note that you also have to sort the output here since rev-list does not > output objects in a meaningful order when using bitmaps.) > > Thanks, > Taylor > > [^1]: Not that it matters here since we don't expect to load the bitmap > anyway, but it's worth doing regardless. > Good advice, Thanks Lidong