Hello, At GitLab, we noticed that bundle creation doesn't seem to scale linearly the number of references in a repository. The following benchmark demostrates the issue: Benchmark 1: bundle (refcount = 100) Time (mean ± σ): 4.4 ms ± 0.5 ms [User: 1.8 ms, System: 2.4 ms] Range (min … max): 3.4 ms … 7.7 ms 434 runs Benchmark 2: bundle (refcount = 1000) Time (mean ± σ): 16.5 ms ± 1.7 ms [User: 9.6 ms, System: 7.2 ms] Range (min … max): 14.1 ms … 21.7 ms 176 runs Benchmark 3: bundle (refcount = 10000) Time (mean ± σ): 220.6 ms ± 3.2 ms [User: 171.6 ms, System: 55.7 ms] Range (min … max): 215.8 ms … 224.9 ms 13 runs Benchmark 4: bundle (refcount = 100000) Time (mean ± σ): 9.622 s ± 0.063 s [User: 9.143 s, System: 0.546 s] Range (min … max): 9.563 s … 9.738 s 10 runs Summary bundle (refcount = 100) ran 3.79 ± 0.61 times faster than bundle (refcount = 1000) 50.63 ± 6.39 times faster than bundle (refcount = 10000) 2207.95 ± 277.35 times faster than bundle (refcount = 100000) Digging into this, the reason for this is because we check for duplicate refnames added by the user. But this check uses an O(N^2) algorithm, which would not scale linearly with the number of refs. The first commit in this small series adds a bunch of tests for this behavior, while also discovering a missed edge case. The second commit introduces an alternative approach which uses an 'strset' to check for duplicates. The new approach fixes the performance problems noticed while also fixing the earlier missed edge case. Overall we see a 6x performance improvement with this series. I found that there is a conflict with 'ps/object-wo-the-repository' in seen, the resolution seems simple enough. Happy to support as needed. --- Changes in v2: - Use STRSET_INIT macro instead of strset_init(). - Link to v1: https://lore.kernel.org/r/20250401-488-generating-bundles-with-many-references-has-non-linear-performance-v1-0-6d23b2d96557@xxxxxxxxx --- bundle.c | 8 +++++++- object.c | 33 ------------------------------- object.h | 6 ------ t/t6020-bundle-misc.sh | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 60 insertions(+), 40 deletions(-) Karthik Nayak (2): t6020: test for duplicate refnames in bundle creation bundle: fix non-linear performance scaling with refs --- Range-diff versus v1: 1: 1fd141cd34 = 1: 2b608cb908 t6020: test for duplicate refnames in bundle creation 2: 25b86d1a6c ! 2: 054389419f bundle: fix non-linear performance scaling with refs @@ bundle.c: static int write_bundle_refs(int bundle_fd, struct rev_info *revs) { int i; int ref_count = 0; -+ struct strset objects; -+ -+ strset_init(&objects); ++ struct strset objects = STRSET_INIT; for (i = 0; i < revs->pending.nr; i++) { struct object_array_entry *e = revs->pending.objects + i; --- base-commit: 683c54c999c301c2cd6f715c411407c413b1d84e change-id: 20250322-488-generating-bundles-with-many-references-has-non-linear-performance-64aec8e0cf1d Thanks - Karthik