Re: [PATCH] wrapper: Fix a errno discrepancy on NetBSD.

shejialuo <shejialuo@xxxxxxxxx> · Mon, 5 May 2025 20:17:19 +0800

On Mon, May 05, 2025 at 08:39:27AM +0200, Patrick Steinhardt wrote:
> On Sat, May 03, 2025 at 11:49:28AM -0400, Jeff King wrote:
> > On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote:
> > 
> > > > PS I notice that this same function reads the whole packed-refs file
> > > >    into a strbuf. That may be a problem, as they can grow pretty big in
> > > >    extreme cases (e.g., GitHub's fork networks easily got into the
> > > >    gigabytes, as it was every ref of every fork). We usually mmap it.
> > > >    Not related to this discussion, but just something I noticed while
> > > >    reading the function.
> > > 
> > > Peff, thanks for notifying me. I want to know more background.
> > > Initially, the reason why I don't use `mmap` is that when checking the
> > > ref consistency, we usually don't need to share the "packed-refs"
> > > content for multiple processes via `mmap`.
> > 
> > You're not sharing with other processes running fsck, but you'd be
> > sharing the memory with all of the other processes using that
> > packed-refs file for normal lookups.
> > 
> > But even if it's shared with nobody, reading it all into memory is
> > strictly worse than just mmap (since the data is getting copied into the
> > new allocation).
> > 
> > > I don't know how Github executes "git fsck" for the forked repositories.
> > > Is there any regular tasks for "git fsck"? And would "packed-refs" file
> > > be shared for all these repositories?
> > 
> > I don't know offhand how often GitHub runs fsck in an automated way
> > these days. Or even how big packed-refs files get, for that matter.
> 
> They typically are at most a couple of megabytes, but there certainly
> are outliers. For as at GitLab.com, the vast majority (>99%) of such
> files is less than 50MB and typically even less than 5MB.
> 
> > The specific case I'm thinking of for GitHub is that each fork network
> > has a master "network.git" repo that stores the objects for all of the
> > forks (which point to it via their objects/info/alternates files).  That
> > network.git repo doesn't technically need to have all of the refs all
> > the time, but in practice it wants to know about them for reachability
> > during repacking, etc.
> > 
> > So it has something like "refs/remotes/<fork_id>/heads/master", and so
> > on, copying the whole refs/* namespace of each fork. If you look at,
> > say, torvalds/linux, the refs data for a single fork is probably ~30k or
> > so (based on looking at what's in a clone). And there are ~55k forks. So
> > that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they
> > have pretty beefy systems), but enough that mmap is probably better.
> > 
> > I'm also sure that's not the worst case. It has a lot of forks but the
> > ref namespace is not that huge compared to some other projects (and it's
> > the product of the two that is the problem).
> 
> Yeah, the interesting case is always the outliers. One of the worst
> offenders we have at GitLab.com is our own "gitlab-org/gitlab"
> repository. This particular repository has a "packed-refs" file that is
> around 2GB in size.
> 
> So I think refactoring this code to use `mmap()` would probably make
> sense.
> 

Thank Peff and Patrick for the information. I will send a patch later.

> Patrick