Re: [PATCH] wrapper: Fix a errno discrepancy on NetBSD.

Jeff King <peff@xxxxxxxx> · Sat, 3 May 2025 11:49:28 -0400

On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote:

> > PS I notice that this same function reads the whole packed-refs file
> >    into a strbuf. That may be a problem, as they can grow pretty big in
> >    extreme cases (e.g., GitHub's fork networks easily got into the
> >    gigabytes, as it was every ref of every fork). We usually mmap it.
> >    Not related to this discussion, but just something I noticed while
> >    reading the function.
> 
> Peff, thanks for notifying me. I want to know more background.
> Initially, the reason why I don't use `mmap` is that when checking the
> ref consistency, we usually don't need to share the "packed-refs"
> content for multiple processes via `mmap`.

You're not sharing with other processes running fsck, but you'd be
sharing the memory with all of the other processes using that
packed-refs file for normal lookups.

But even if it's shared with nobody, reading it all into memory is
strictly worse than just mmap (since the data is getting copied into the
new allocation).

> I don't know how Github executes "git fsck" for the forked repositories.
> Is there any regular tasks for "git fsck"? And would "packed-refs" file
> be shared for all these repositories?

I don't know offhand how often GitHub runs fsck in an automated way
these days. Or even how big packed-refs files get, for that matter.

The specific case I'm thinking of for GitHub is that each fork network
has a master "network.git" repo that stores the objects for all of the
forks (which point to it via their objects/info/alternates files).  That
network.git repo doesn't technically need to have all of the refs all
the time, but in practice it wants to know about them for reachability
during repacking, etc.

So it has something like "refs/remotes/<fork_id>/heads/master", and so
on, copying the whole refs/* namespace of each fork. If you look at,
say, torvalds/linux, the refs data for a single fork is probably ~30k or
so (based on looking at what's in a clone). And there are ~55k forks. So
that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they
have pretty beefy systems), but enough that mmap is probably better.

I'm also sure that's not the worst case. It has a lot of forks but the
ref namespace is not that huge compared to some other projects (and it's
the product of the two that is the problem).

> If above is the case, I agree that we should reuse the logic of
> "load_contents" to enhance. But I don't know whether we need to do this
> in the first place.

I think you can skip the stat validity bits. In theory you can also skip
the mmap_strategy stuff, but I guess it might mean that "fsck" could
block other writers on Windows temporarily (though we wouldn't plan to
hold it open long, the way the normal reader does).

The other gotcha is that the result won't be NUL-terminated, but it
looks like the helper functions already take an "eof" pointer to avoid
looking past the end of what was read.

-Peff