Re: [PATCH 2/9] commit-graph: stop using signed integers to count bloom filters

Taylor Blau <me@xxxxxxxxxxxx> · Mon, 4 Aug 2025 17:44:06 -0400

On Mon, Aug 04, 2025 at 11:34:22AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@xxxxxx> writes:
>
> > On Mon, Aug 04, 2025 at 11:13:28AM +0200, Oswald Buddenhagen wrote:
> >> On Mon, Aug 04, 2025 at 10:17:18AM +0200, Patrick Steinhardt wrote:
> >> > When writing a new commit graph we have a couple of counters that
> >> > provide statistics around what kind of bloom filters we have or have not
> >> > written. These counters naturally count from zero and are only ever
> >> > incremented, but they use a signed integer as type regardless.
> >> >
> >> > Refactor those fields to be of type `size_t` instead.
> >> >
> >> mind elaborating on that choice?
> >
> > We tend to use `size_t` when counting stuff.
>
> And I would have to say that it is wrong and we need to wean
> ourselves from such a superstition.  Unless you are measuring how
> big a memory block you ask from the allocator, the platform natural
> integer is often the right type to do the counting.
>
> Each of your "stuff" may weigh N megabytes in core, and if you have
> M of them, you may have to ask (N*2**20)*M bytes of memory from the
> allocator.  Your (N*2**20)*M must fit size_t _and_ you must compute
> it without overflowing or wrapping around.
>
> None of the above mean you have to express N in size_t, though.
> And more importantly, nobody gives you any extra guarantee that you
> would compute the result correctly if you used size_t.  You can write
> the right code with platform natural integer, and you have to take
> the same care (e.g. by using st_mult()) to catch integer overflows
> even if you used size_t.

Agreed. I think it makes sense to use size_t to keep track of, say, the
length and allocated size of a buffer, but when it comes to "counting"
something that isn't directly related to memory allocation or pointer
arithmetic, size_t is usually not the right choice.

For instance, the MIDX code counts the number of objects and packs in a
given MIDX (and likewise for its base MIDX(s)), but those are all
uint32_t. You could make the case to say that, "well, they are encoded
in the file format as 4-byte unsigned values, so we should treat them
the same way in memory at read-time", and I think that's reasonable. In
that instance, using "int" would be the wrong choice, since I have
definitely seen repositories that have in excess of 2^32-1 objects.
But there is no reason that we shouldn't use size_t to make that count.

> > ... Regarding the data size I
> > don't really think that matters much. It's not like we have hundreds of
> > thousands of commit graphs in-memory at any point in time.
>
> Aren't you saying that a platform natural integer is a much better
> fit?
>
> As to signedness, it sometimes is better for a struct member that is
> used to record the number of "stuff" you have to be a signed integer
> that is initialized to -1 to signal "we haven't counted so we do not
> yet know how many there are".  So
>
>     These counters naturally count from zero and are only ever
>     incremented.
>
> is not always a valid excuse to insist that such a variable must be
> unsigned.

I wrote these counters in 312cff5207 (bloom: split 'get_bloom_filter()'
in two, 2020-09-16) and 59f0d5073f (bloom: encode out-of-bounds filters
as non-empty, 2020-09-17), and I don't see a compelling reason that
these should be unsigned.

It's true that we don't have any need for negative values here since we
are counting from zero, but I don't think that alone justifies changing
the signed-ness here.

Is there a reason beyond "these are always non-negative" that changing
the signed-ness is warranted? If so, let's discuss that and make sure
that it is documented in the commit message. If not, I think we could
drop this patch (and optionally the patch before it as well).

Thanks,
Taylor