On Sat, May 03, 2025 at 10:58:58PM +0800, shejialuo wrote: > > PS I notice that this same function reads the whole packed-refs file > > into a strbuf. That may be a problem, as they can grow pretty big in > > extreme cases (e.g., GitHub's fork networks easily got into the > > gigabytes, as it was every ref of every fork). We usually mmap it. > > Not related to this discussion, but just something I noticed while > > reading the function. > > Peff, thanks for notifying me. I want to know more background. > Initially, the reason why I don't use `mmap` is that when checking the > ref consistency, we usually don't need to share the "packed-refs" > content for multiple processes via `mmap`. You're not sharing with other processes running fsck, but you'd be sharing the memory with all of the other processes using that packed-refs file for normal lookups. But even if it's shared with nobody, reading it all into memory is strictly worse than just mmap (since the data is getting copied into the new allocation). > I don't know how Github executes "git fsck" for the forked repositories. > Is there any regular tasks for "git fsck"? And would "packed-refs" file > be shared for all these repositories? I don't know offhand how often GitHub runs fsck in an automated way these days. Or even how big packed-refs files get, for that matter. The specific case I'm thinking of for GitHub is that each fork network has a master "network.git" repo that stores the objects for all of the forks (which point to it via their objects/info/alternates files). That network.git repo doesn't technically need to have all of the refs all the time, but in practice it wants to know about them for reachability during repacking, etc. So it has something like "refs/remotes/<fork_id>/heads/master", and so on, copying the whole refs/* namespace of each fork. If you look at, say, torvalds/linux, the refs data for a single fork is probably ~30k or so (based on looking at what's in a clone). And there are ~55k forks. So that's around 1.5G. Not a deal-breaker to allocate (keeping in mind they have pretty beefy systems), but enough that mmap is probably better. I'm also sure that's not the worst case. It has a lot of forks but the ref namespace is not that huge compared to some other projects (and it's the product of the two that is the problem). > If above is the case, I agree that we should reuse the logic of > "load_contents" to enhance. But I don't know whether we need to do this > in the first place. I think you can skip the stat validity bits. In theory you can also skip the mmap_strategy stuff, but I guess it might mean that "fsck" could block other writers on Windows temporarily (though we wouldn't plan to hold it open long, the way the normal reader does). The other gotcha is that the result won't be NUL-terminated, but it looks like the helper functions already take an "eof" pointer to avoid looking past the end of what was read. -Peff