Re: Continuous Benchmarking

Patrick Steinhardt <ps@xxxxxx> · Fri, 21 Feb 2025 09:48:16 +0100

On Wed, Feb 05, 2025 at 03:14:21PM -0800, Emily Shaffer wrote:
> On Mon, Feb 3, 2025 at 1:55 AM Patrick Steinhardt <ps@xxxxxx> wrote:
> >
> > Hi,
> >
> > due to a couple performance regressions that we have hit over the last
> > couple Git releases at GitLab, we have started to set up an effort to
> > implement continuous benchmarking for the Git project. The intent is to
> > have regular (daily) benchmarking runs against Git's `master` and `next`
> > branches to be able to spot any performance regressions before they make
> > it into the next release.
> >
> > I have started with a relatively simple setup:
> >
> >   - I have started collection benchmarks that I myself do regularly [1].
> >     These benchmarks are built on hyperfine and are thus not part of the
> >     Git repository itself.
> >
> >   - GitLab CI runs on a nightly basis, executing a subset of these
> >     benchmarks [2].
> >
> >   - Results are uploaded with a hyperfine adaptor to Bencher and are
> >     summarized in dashboards.
> >
> > This at least gives us some visibility in severe performance outliers,
> > whether these are improvements or regressions. Some statistics are
> > applied on this data to automatically generate alerts when things are
> > significantly changing.
> >
> > The setup is of course not perfect. It's built on top of CI jobs, which
> > are by their very nature not really performing consistent. The scripts
> > are hosted outside of Git. And I'm the only one running this.
> 
> For the CI "noisy neighbors" problem at least, it could be an option
> to try to host in GCE (or some other compute that isn't shared). I
> asked around a little inside Google and it seems like it's possible,
> I'll keep pushing on it and see just how hard it would be. I'd even be
> happy to trade on-push runs with noisy neighbors for nightly runs with
> no neighbors, which makes it not really a CI thing - guess I will find
> out if that's easier or harder for us to implement. :)

That would be awesome.

> > So I wonder whether there is a wider interest in the Git community to
> > have this infrastructure part of the Git project itself. This may
> > include steps like the following:
> >
> >   - Extending our performance tests we have in "t/perf" to cover more
> >     benchmarks.
> 
> Folks may be aware that our biggest (in terms of scale) internal
> customer at Google is Android project. They are the ones who complain
> to me and my team the most about performance; they are also open to
> setting up nightly performance regression test. Would it be appealing
> to get reports from such a test upstream? I think it's more compelling
> to our customer team if we run it against the closed-source Android
> repo, which means the Git project doesn't get to see as much about the
> shape and content of the repos the performance tests are running
> against, but we might be able to publish info about the shape without
> the contents. Would that be useful? What would help to know (# of
> commits, size of largest object, distribution of object size, # of
> branches, size of worktree...?) If not having the specifics of the
> repo-under-test is a dealbreaker we could explore running performance
> tests in public with Android Open Source Project as the
> repo-under-test instead, but it's much more manageable than full
> Android.

The biggest question is whether such regression reports would be
actionable by the Git community. I often found performance issues to be
very specific to the repository at hand, and reconstructing the exact
situation tends to be extremely tedious or completely infeasible. I run
into the situation way too often where customers come knock at my door
with a performance issue, but don't want to provide the underlying data.
More often than not I end up not being able to reproduce, so I have to
push back on such reports.

Ideally, any report should be accompanied by a trivial reproducer that
any developer can execute on their local machine.

> Maybe in the long term it would be even better to have some toy
> repo-under-test, like "sample repo with massive object store", "sample
> repo with massive history", etc. to help us pinpoint which ways we're
> scaling well and which ways we aren't. But having a ready made
> repo-under-test, and a team who's got a very large stake in Git
> performing well with it (so they can invest their time in setting up
> tests), might be a good enough place to start.

That would be great. I guess this wouldn't be a single repository, but a
set of repositories that have different kinds of characteristics.

> >   - Writing an adaptor that is able to upload the data generated from
> >     our perf scripts to Bencher.
> >
> >   - Setting up proper infrastructure to do the benchmarking. We may for
> >     now also continue to use GitLab CI, but as said they are quite noisy
> >     overall. Dedicated servers would help here.
> >
> >   - Sending alerts to the Git mailing list.
> 
> Yeah, I'd love to see reports coming to Git mailing list, or at least
> bad news reports (maybe we don't need "everything ran great!" every
> night, but would appreciate "last night the performance suite ran 50%
> slower than last-6-months average"). That seems the easiest to
> integrate with the way the project runs now, and I think we are used
> to list noise :)

Oh, totally, I certainly don't think there's any benefit in reporting
anything when there is no information. Right now there still are semi-
frequent outliers where an alert is generated only because of a flake,
not a real performance regression. But my hope would be that we can
address this issue once we address the noisy neighbour problem.

> > I'm happy to hear your thoughts on this. Any ideas are welcome,
> > including "we're not interested at all". In that case, we'd simply
> > continue to maintain the setup ourselves at GitLab.
> 
> In general, though, yes! I am very interested! Google had trouble with
> performance regressions over the last 3 months or so, I'd love to see
> the community noticing it more. I think in general we have a sense
> that performance matters, during code review, but aren't always sure
> where it matters most, and a regular performance test that anybody can
> see the results of would help a lot.

Thanks for your input!

Patrick