Re: [PATCH] wrapper: Fix a errno discrepancy on NetBSD.

Patrick Steinhardt <ps@xxxxxx> · Mon, 5 May 2025 08:39:27 +0200

On Sat, May 03, 2025 at 11:49:28AM -0400, Jeff King wrote:
> On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote:
> 
> > > PS I notice that this same function reads the whole packed-refs file
> > >    into a strbuf. That may be a problem, as they can grow pretty big in
> > >    extreme cases (e.g., GitHub's fork networks easily got into the
> > >    gigabytes, as it was every ref of every fork). We usually mmap it.
> > >    Not related to this discussion, but just something I noticed while
> > >    reading the function.
> > 
> > Peff, thanks for notifying me. I want to know more background.
> > Initially, the reason why I don't use `mmap` is that when checking the
> > ref consistency, we usually don't need to share the "packed-refs"
> > content for multiple processes via `mmap`.
> 
> You're not sharing with other processes running fsck, but you'd be
> sharing the memory with all of the other processes using that
> packed-refs file for normal lookups.
> 
> But even if it's shared with nobody, reading it all into memory is
> strictly worse than just mmap (since the data is getting copied into the
> new allocation).
> 
> > I don't know how Github executes "git fsck" for the forked repositories.
> > Is there any regular tasks for "git fsck"? And would "packed-refs" file
> > be shared for all these repositories?
> 
> I don't know offhand how often GitHub runs fsck in an automated way
> these days. Or even how big packed-refs files get, for that matter.

They typically are at most a couple of megabytes, but there certainly
are outliers. For as at GitLab.com, the vast majority (>99%) of such
files is less than 50MB and typically even less than 5MB.

> The specific case I'm thinking of for GitHub is that each fork network
> has a master "network.git" repo that stores the objects for all of the
> forks (which point to it via their objects/info/alternates files).  That
> network.git repo doesn't technically need to have all of the refs all
> the time, but in practice it wants to know about them for reachability
> during repacking, etc.
> 
> So it has something like "refs/remotes/<fork_id>/heads/master", and so
> on, copying the whole refs/* namespace of each fork. If you look at,
> say, torvalds/linux, the refs data for a single fork is probably ~30k or
> so (based on looking at what's in a clone). And there are ~55k forks. So
> that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they
> have pretty beefy systems), but enough that mmap is probably better.
> 
> I'm also sure that's not the worst case. It has a lot of forks but the
> ref namespace is not that huge compared to some other projects (and it's
> the product of the two that is the problem).

Yeah, the interesting case is always the outliers. One of the worst
offenders we have at GitLab.com is our own "gitlab-org/gitlab"
repository. This particular repository has a "packed-refs" file that is
around 2GB in size.

So I think refactoring this code to use `mmap()` would probably make
sense.

Patrick