Re: [PATCH] git-compat-util: introduce `count_t` typedef

Matthias Aßhauer <mha1993@xxxxxxx> · Thu, 7 Aug 2025 13:00:24 +0200

On Thu, 7 Aug 2025, Patrick Steinhardt wrote:

Historically, Git has been very lenient with its use of integer types
and didn't really give much thought into which type to use in what
situation. We interchangeably mix and match signed and unsigned types
and often times blindly convert them. This use has led to several
out-of-bounds reads and writes in the past, some of which could be
turned into arbitrary code execution.

As a counter measure we have eventually enabled "-Wsign-compare"
warnings. Most of our code base generates heaps of warnings, which is
why we have a macro `DISABLE_SIGN_COMPARE_WARNINGS` defined for every
such file. The expectation is that slowly but surely we'll convert our
code base to have better hygiene around signedness, and new code that is
being added handles types correctly from the start.

There are regular discussions around whether or not these warnings are
sensible to have in the first place. My (biased) opinion with having
fixed several out-of-bounds reads and writes is that they are senisble,
as they would have provided warnings around code sites that had those
issues. And arguably, we still have _lots_ of sites that are susceptible
to using the wrong type, and more likely than not some of those will be
exploitable.

Furthermore, I would claim that the question of whether or not those
warnings are helpful wouldn't have come up if we had the warnings
enabled from the inception of Git. The churn caused by the fixes for
such warnings is real, and they need to be done with a lot of care. But
since we have removed this project from our microprojects page we don't
see "random" contributions in this area anymore.

So overall, the conversions are on the painful side, but in the long
term they will help us to protect against introducing new exploits.

A discussion that regularly comes up in this context though is what
types to use for counting entities:

 - One question is whether the type should be signed or unsigned.
   Arguably, the answer should be to use unsigned types as long as we
   know that we never need a negative value, e.g. as a sentinel. This
   helps guide the reader and explicitly conveys the sense that such a
   counter is only ever going to be a non-negative number. Otherwise,
   code would need to be more careful as it may hold negative values.

 - Another question is what type to use. In lots of situations we have
   used `size_t`, but this is conflating semantics. `size_t` is used to
   count bytes, not entities.

Introduce a new typedef for `count_t` that is of type `uintptr_t` to
give clear guidance what type to use for counting entities. This type
was chosen because in the worst case, an entity may be a single byte and
we fill all of our memory with these entities. As `uintptr_t` is
guaranteed to hold at least the value of a pointer, we know that it
could be used to index into every single such entity.

Amend the coding guidelines to state when to use `size_t` and when to
use `count_t`. Convert an example file to use the new type.

Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
---
Documentation/CodingGuidelines |  3 +++
builtin/rm.c                   | 25 ++++++++++++-------------
git-compat-util.h              | 15 +++++++++++++++
3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/Documentation/CodingGuidelines b/Documentation/CodingGuidelines
index 224f0978a8..2e9f3c07ff 100644
--- a/Documentation/CodingGuidelines
+++ b/Documentation/CodingGuidelines
@@ -238,6 +238,9 @@ For shell scripts specifically (not exhaustive):

For C programs:

+ - We use `size_t` to count the number of bytes and `count_t` to count the
+   number of entities of a given type.
+
 - We use tabs to indent, and interpret tabs as taking up to
   8 spaces.

diff --git a/builtin/rm.c b/builtin/rm.c
index 05d89e98c3..99b845cf34 100644
--- a/builtin/rm.c
+++ b/builtin/rm.c
@@ -33,11 +33,11 @@ static const char * const builtin_rm_usage[] = {
};

static struct {
-	int nr, alloc;
	struct {
		const char *name;
		char is_submodule;
	} *entry;
+	count_t entry_nr, entry_alloc;
} list;

static int get_ours_cache_pos(const char *path, unsigned int pos)
@@ -73,8 +73,7 @@ static void print_error_files(struct string_list *files_list,

static void submodules_absorb_gitdir_if_needed(void)
{
-	int i;
-	for (i = 0; i < list.nr; i++) {
+	for (count_t i = 0; i < list.entry_nr; i++) {
		const char *name = list.entry[i].name;
		int pos;
		const struct cache_entry *ce;
@@ -106,14 +105,14 @@ static int check_local_mod(struct object_id *head, int index_only)
	 * lazy, and who cares if removal of files is a tad
	 * slower than the theoretical maximum speed?
	 */
-	int i, no_head;
+	int no_head;
	int errs = 0;
	struct string_list files_staged = STRING_LIST_INIT_NODUP;
	struct string_list files_cached = STRING_LIST_INIT_NODUP;
	struct string_list files_local = STRING_LIST_INIT_NODUP;

	no_head = is_null_oid(head);
-	for (i = 0; i < list.nr; i++) {
+	for (count_t i = 0; i < list.entry_nr; i++) {
		struct stat st;
		int pos;
		const struct cache_entry *ce;
@@ -268,7 +267,7 @@ int cmd_rm(int argc,
	   struct repository *repo UNUSED)
{
	struct lock_file lock_file = LOCK_INIT;
-	int i, ret = 0;
+	int ret = 0;
	struct pathspec pathspec;
	char *seen;

@@ -321,10 +320,10 @@ int cmd_rm(int argc,
			continue;
		if (!ce_path_match(the_repository->index, ce, &pathspec, seen))
			continue;
-		ALLOC_GROW(list.entry, list.nr + 1, list.alloc);
-		list.entry[list.nr].name = xstrdup(ce->name);
-		list.entry[list.nr].is_submodule = S_ISGITLINK(ce->ce_mode);
-		if (list.entry[list.nr++].is_submodule &&
+		ALLOC_GROW(list.entry, list.entry_nr + 1, list.entry_alloc);
+		list.entry[list.entry_nr].name = xstrdup(ce->name);
+		list.entry[list.entry_nr].is_submodule = S_ISGITLINK(ce->ce_mode);
+		if (list.entry[list.entry_nr++].is_submodule &&
		    !is_staging_gitmodules_ok(the_repository->index))
			die(_("please stage your changes to .gitmodules or stash them to proceed"));
	}

This hunk doesn't deal with count_t at all. Should we split the renaming 
of nr and alloc into a separate patch?
@@ -335,7 +334,7 @@ int cmd_rm(int argc,
		char *skip_worktree_seen = NULL;
		struct string_list only_match_skip_worktree = STRING_LIST_INIT_NODUP;

-		for (i = 0; i < pathspec.nr; i++) {
+		for (int i = 0; i < pathspec.nr; i++) {

Is this i intentionally still an int?

			original = pathspec.items[i].original;
			if (seen[i])
				seen_any = 1;
@@ -390,7 +389,7 @@ int cmd_rm(int argc,
	 * First remove the names from the index: we won't commit
	 * the index unless all of them succeed.
	 */
-	for (i = 0; i < list.nr; i++) {
+	for (count_t i = 0; i < list.entry_nr; i++) {
		const char *path = list.entry[i].name;
		if (!quiet)
			printf("rm '%s'\n", path);
@@ -414,7 +413,7 @@ int cmd_rm(int argc,
		int removed = 0, gitmodules_modified = 0;
		struct strbuf buf = STRBUF_INIT;
		int flag = force ? REMOVE_DIR_PURGE_ORIGINAL_CWD : 0;
-		for (i = 0; i < list.nr; i++) {
+		for (count_t i = 0; i < list.entry_nr; i++) {
			const char *path = list.entry[i].name;
			if (list.entry[i].is_submodule) {
				strbuf_reset(&buf);
diff --git a/git-compat-util.h b/git-compat-util.h
index 9408f463e3..e9c30d59e8 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -610,6 +610,21 @@ static inline bool strip_suffix(const char *str, const char *suffix,
int git_open_cloexec(const char *name, int flags);
#define git_open(name) git_open_cloexec(name, O_RDONLY)

+/*
+ * The type used to count the number of entities, e.g. in an array. We have
+ * historically used `size_t` for this, but `size_t` is expected to count the
+ * maximum number of _bytes_, not entities.
+ *
+ * The counter is unsigned. If you need to store sentinel values like `-1` you
+ * should use a different type.

Do we want to make a recommendation of a "different type" here to keep 
things consistent?
+ *
+ * Note that we pick `uintptr_t` because in the theoretical worst case, every
+ * entity is a single byte and we populate the entire address space with them.
+ * As `uintptr_t` is able to point to every addressable byte it would also be
+ * able to count them all.
+ */
+typedef uintptr_t count_t;
+
static inline size_t st_add(size_t a, size_t b)
{
	if (unsigned_add_overflows(a, b))

---
base-commit: 64cbe5e2e8a7b0f92c780b210e602496bd5cad0f
change-id: 20250807-pks-introduce-count-t-0f4499f80221



Best regards

Matthias