On Mon, Aug 04, 2025 at 11:34:22AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@xxxxxx> writes: > > > On Mon, Aug 04, 2025 at 11:13:28AM +0200, Oswald Buddenhagen wrote: > >> On Mon, Aug 04, 2025 at 10:17:18AM +0200, Patrick Steinhardt wrote: > >> > When writing a new commit graph we have a couple of counters that > >> > provide statistics around what kind of bloom filters we have or have not > >> > written. These counters naturally count from zero and are only ever > >> > incremented, but they use a signed integer as type regardless. > >> > > >> > Refactor those fields to be of type `size_t` instead. > >> > > >> mind elaborating on that choice? > > > > We tend to use `size_t` when counting stuff. > > And I would have to say that it is wrong and we need to wean > ourselves from such a superstition. Unless you are measuring how > big a memory block you ask from the allocator, the platform natural > integer is often the right type to do the counting. > > Each of your "stuff" may weigh N megabytes in core, and if you have > M of them, you may have to ask (N*2**20)*M bytes of memory from the > allocator. Your (N*2**20)*M must fit size_t _and_ you must compute > it without overflowing or wrapping around. > > None of the above mean you have to express N in size_t, though. > And more importantly, nobody gives you any extra guarantee that you > would compute the result correctly if you used size_t. You can write > the right code with platform natural integer, and you have to take > the same care (e.g. by using st_mult()) to catch integer overflows > even if you used size_t. Agreed. I think it makes sense to use size_t to keep track of, say, the length and allocated size of a buffer, but when it comes to "counting" something that isn't directly related to memory allocation or pointer arithmetic, size_t is usually not the right choice. For instance, the MIDX code counts the number of objects and packs in a given MIDX (and likewise for its base MIDX(s)), but those are all uint32_t. You could make the case to say that, "well, they are encoded in the file format as 4-byte unsigned values, so we should treat them the same way in memory at read-time", and I think that's reasonable. In that instance, using "int" would be the wrong choice, since I have definitely seen repositories that have in excess of 2^32-1 objects. But there is no reason that we shouldn't use size_t to make that count. > > ... Regarding the data size I > > don't really think that matters much. It's not like we have hundreds of > > thousands of commit graphs in-memory at any point in time. > > Aren't you saying that a platform natural integer is a much better > fit? > > As to signedness, it sometimes is better for a struct member that is > used to record the number of "stuff" you have to be a signed integer > that is initialized to -1 to signal "we haven't counted so we do not > yet know how many there are". So > > These counters naturally count from zero and are only ever > incremented. > > is not always a valid excuse to insist that such a variable must be > unsigned. I wrote these counters in 312cff5207 (bloom: split 'get_bloom_filter()' in two, 2020-09-16) and 59f0d5073f (bloom: encode out-of-bounds filters as non-empty, 2020-09-17), and I don't see a compelling reason that these should be unsigned. It's true that we don't have any need for negative values here since we are counting from zero, but I don't think that alone justifies changing the signed-ness here. Is there a reason beyond "these are always non-negative" that changing the signed-ness is warranted? If so, let's discuss that and make sure that it is documented in the commit message. If not, I think we could drop this patch (and optionally the patch before it as well). Thanks, Taylor