On Sat, May 03, 2025 at 11:49:28AM -0400, Jeff King wrote: > On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote: > > > > PS I notice that this same function reads the whole packed-refs file > > > into a strbuf. That may be a problem, as they can grow pretty big in > > > extreme cases (e.g., GitHub's fork networks easily got into the > > > gigabytes, as it was every ref of every fork). We usually mmap it. > > > Not related to this discussion, but just something I noticed while > > > reading the function. > > > > Peff, thanks for notifying me. I want to know more background. > > Initially, the reason why I don't use `mmap` is that when checking the > > ref consistency, we usually don't need to share the "packed-refs" > > content for multiple processes via `mmap`. > > You're not sharing with other processes running fsck, but you'd be > sharing the memory with all of the other processes using that > packed-refs file for normal lookups. > > But even if it's shared with nobody, reading it all into memory is > strictly worse than just mmap (since the data is getting copied into the > new allocation). > > > I don't know how Github executes "git fsck" for the forked repositories. > > Is there any regular tasks for "git fsck"? And would "packed-refs" file > > be shared for all these repositories? > > I don't know offhand how often GitHub runs fsck in an automated way > these days. Or even how big packed-refs files get, for that matter. They typically are at most a couple of megabytes, but there certainly are outliers. For as at GitLab.com, the vast majority (>99%) of such files is less than 50MB and typically even less than 5MB. > The specific case I'm thinking of for GitHub is that each fork network > has a master "network.git" repo that stores the objects for all of the > forks (which point to it via their objects/info/alternates files). That > network.git repo doesn't technically need to have all of the refs all > the time, but in practice it wants to know about them for reachability > during repacking, etc. > > So it has something like "refs/remotes/<fork_id>/heads/master", and so > on, copying the whole refs/* namespace of each fork. If you look at, > say, torvalds/linux, the refs data for a single fork is probably ~30k or > so (based on looking at what's in a clone). And there are ~55k forks. So > that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they > have pretty beefy systems), but enough that mmap is probably better. > > I'm also sure that's not the worst case. It has a lot of forks but the > ref namespace is not that huge compared to some other projects (and it's > the product of the two that is the problem). Yeah, the interesting case is always the outliers. One of the worst offenders we have at GitLab.com is our own "gitlab-org/gitlab" repository. This particular repository has a "packed-refs" file that is around 2GB in size. So I think refactoring this code to use `mmap()` would probably make sense. Patrick