On Mon, May 05, 2025 at 08:39:27AM +0200, Patrick Steinhardt wrote: > On Sat, May 03, 2025 at 11:49:28AM -0400, Jeff King wrote: > > On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote: > > > > > > PS I notice that this same function reads the whole packed-refs file > > > > into a strbuf. That may be a problem, as they can grow pretty big in > > > > extreme cases (e.g., GitHub's fork networks easily got into the > > > > gigabytes, as it was every ref of every fork). We usually mmap it. > > > > Not related to this discussion, but just something I noticed while > > > > reading the function. > > > > > > Peff, thanks for notifying me. I want to know more background. > > > Initially, the reason why I don't use `mmap` is that when checking the > > > ref consistency, we usually don't need to share the "packed-refs" > > > content for multiple processes via `mmap`. > > > > You're not sharing with other processes running fsck, but you'd be > > sharing the memory with all of the other processes using that > > packed-refs file for normal lookups. > > > > But even if it's shared with nobody, reading it all into memory is > > strictly worse than just mmap (since the data is getting copied into the > > new allocation). > > > > > I don't know how Github executes "git fsck" for the forked repositories. > > > Is there any regular tasks for "git fsck"? And would "packed-refs" file > > > be shared for all these repositories? > > > > I don't know offhand how often GitHub runs fsck in an automated way > > these days. Or even how big packed-refs files get, for that matter. > > They typically are at most a couple of megabytes, but there certainly > are outliers. For as at GitLab.com, the vast majority (>99%) of such > files is less than 50MB and typically even less than 5MB. > > > The specific case I'm thinking of for GitHub is that each fork network > > has a master "network.git" repo that stores the objects for all of the > > forks (which point to it via their objects/info/alternates files). That > > network.git repo doesn't technically need to have all of the refs all > > the time, but in practice it wants to know about them for reachability > > during repacking, etc. > > > > So it has something like "refs/remotes/<fork_id>/heads/master", and so > > on, copying the whole refs/* namespace of each fork. If you look at, > > say, torvalds/linux, the refs data for a single fork is probably ~30k or > > so (based on looking at what's in a clone). And there are ~55k forks. So > > that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they > > have pretty beefy systems), but enough that mmap is probably better. > > > > I'm also sure that's not the worst case. It has a lot of forks but the > > ref namespace is not that huge compared to some other projects (and it's > > the product of the two that is the problem). > > Yeah, the interesting case is always the outliers. One of the worst > offenders we have at GitLab.com is our own "gitlab-org/gitlab" > repository. This particular repository has a "packed-refs" file that is > around 2GB in size. > > So I think refactoring this code to use `mmap()` would probably make > sense. > Thank Peff and Patrick for the information. I will send a patch later. > Patrick