Re: what role should github checks play in Ceph qa?

Neal Gompa <ngompa13@xxxxxxxxx> · Wed, 18 Jun 2025 06:30:46 +0200

On Mon, Jun 16, 2025 at 6:09 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Wed, Jun 4, 2025 at 6:47 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >
> > Ceph is a large and complex distributed system, so requires extensive
> > test coverage against many different configurations. teuthology, "The
> > Ceph integration test framework", was built for this and remains an
> > integral part of our qa process: every major/minor release goes
> > through several teuthology suites for validation, and most pull
> > requests go through at least one suite before merging
> >
> > given this use of teuthology, i'd like to question the role of our
> > github checks (make check, ceph api tests, etc) in the qa process.
> > over the years, we've continued to increase the amount of test
> > coverage in these checks. and while more test coverage is a good
> > thing, i don't think this approach is sustainable. checks often take
> > several hours before completing, so no longer give timely feedback on
> > changes. and because of failure rates, that feedback is often not
> > accurate and may require several reruns. this is a waste of machine
> > and developer time
> >
> > almost all of our github checks are "Required", which prevents pull
> > requests from being merged unless everything is green. required checks
> > are meant to prevent the merging of disruptive changes - for example,
> > breaking the build is extremely disruptive to the whole project.
> > however, the checks themselves can also be disruptive when they fail
> > randomly. because of this, i think we should have a high bar for
> > adding new required checks or extending existing ones
> >
> > it's my opinion that we're just doing way too much in github checks
> > that could be moved to the relevant teuthology suites instead. as long
> > as pull requests end up going through teuthology, we don't need github
> > checks to provide extensive test coverage. i would far prefer we
> > optimize these checks to minimize disruption
> >
> > if we're going to make improvements here, i think we as a group need
> > to agree on:
> > * what role should github checks play in the qa process?
> > * what criteria should we use when deciding what belongs in github
> > checks vs teuthology?
> >
> > please share your opinions here! what follows is my own:
> >
> > i would propose that we move all integration tests (especially
> > anything that needs a ceph cluster) and "long-running" unit tests to
> > teuthology instead. if the tests run in teuthology, we don't need to
> > run/rerun them every time the pull request is updated. this policy
> > could get our checks down to 30-45 minutes and significantly reduce
> > the load on our jenkins machines - all without sacrificing our ability
> > to ensure the quality of our releases
>
> I mostly agree with this. I will note we have some stuff that technically invokes vstart which I think is appropriate to leave in "make check" — namely, our API/command tests that validate commands run at all.
>
>
> >
> > and while this would take a concerted effort, it's something we could
> > make incremental progress on. we could start by focusing on the
> > outliers by runtime, which John Mulligan recently captured over a
> > handful of 'make check' results:
> >
> > select name,avg(duration) as dur from stats group by name order by dur
> > DESC limit 15;
> > unittest_transaction_manager|4133.85
> > mgr_dashboard_frontend_unittests|2550.8
> > unittest_omap_manager|2471.05
> > run_rbd_unit_tests_61.sh|2306.3
> > unittest_object_data_handler|2213.95
> > readable.sh|2081.1
> > run_rbd_unit_tests_1.sh|2059.55
> > unittest_seastore|1679.0
> > unittest_bluefs|1217.15
> > run_tox_mgr|1008.7
> > unittest_bufferlist|992.9
> > smoke.sh|922.65
> > run_rbd_unit_tests_127.sh|906.4
> > run_rbd_unit_tests_0.sh|817.25
> > run_rbd_unit_tests_N.sh|699.9
>
> Those are...wow, really long-running.
> I think any given test we include in "make check" should be measured in low single-digit seconds to be placed there.
> Github checks should be restricted to things that run quickly or are otherwise saving time against full teuthology suites, or in some way superior to using teuthology. We run "make check" because we want to know a PR isn't broken before including it in an integration test. We build the docs because we aren't doing that in teuthology, and it doesn't take long (?I think?) and is convenient. We check s-o-b statements because it's convenient and fast and there's not a different place in the pipeline where that makes sense.
>
> So there are 3 items that I presume are driving most of the concern:
> 1) long-running "make check" jobs. People can fight for individual tests if they want, but I think we start by generating a list of everything which takes >60 seconds and move it to a teuthology suite. We can pare it farther if needed.
> 2) Windows builds, probably? We don't have any infrastructure for that in teuthology, so as long as we're not tossing them out of the project, this is probably the place they need to stay. It's also fairly appropriate as a build test and we want to know if we are breaking that build.
> 3) ceph-api tests. I think these were our first GitHub check and we put them there because these covered the various “ceph tell” commands of every component, the tests didn’t take long, and getting every component to evaluate those results on every PR was an impractical challenge.
>
> We need to figure out something for the api tests so we don’t start gratuitously breaking the API. I don’t have strong opinions on what that “something” is, though.

Something to consider is that GitHub checks will run on forks, whereas
the full teuthology suite will not. It's valuable to have stuff run in
GitHub checks as a sanity check before people submit PRs too.

-- 
真実はいつも一つ！/ Always, there's only one truth!
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx