what role should github checks play in Ceph qa?

Casey Bodley <cbodley@xxxxxxxxxx> · Wed, 4 Jun 2025 09:46:54 -0400

Ceph is a large and complex distributed system, so requires extensive
test coverage against many different configurations. teuthology, "The
Ceph integration test framework", was built for this and remains an
integral part of our qa process: every major/minor release goes
through several teuthology suites for validation, and most pull
requests go through at least one suite before merging

given this use of teuthology, i'd like to question the role of our
github checks (make check, ceph api tests, etc) in the qa process.
over the years, we've continued to increase the amount of test
coverage in these checks. and while more test coverage is a good
thing, i don't think this approach is sustainable. checks often take
several hours before completing, so no longer give timely feedback on
changes. and because of failure rates, that feedback is often not
accurate and may require several reruns. this is a waste of machine
and developer time

almost all of our github checks are "Required", which prevents pull
requests from being merged unless everything is green. required checks
are meant to prevent the merging of disruptive changes - for example,
breaking the build is extremely disruptive to the whole project.
however, the checks themselves can also be disruptive when they fail
randomly. because of this, i think we should have a high bar for
adding new required checks or extending existing ones

it's my opinion that we're just doing way too much in github checks
that could be moved to the relevant teuthology suites instead. as long
as pull requests end up going through teuthology, we don't need github
checks to provide extensive test coverage. i would far prefer we
optimize these checks to minimize disruption

if we're going to make improvements here, i think we as a group need
to agree on:
* what role should github checks play in the qa process?
* what criteria should we use when deciding what belongs in github
checks vs teuthology?

please share your opinions here! what follows is my own:

i would propose that we move all integration tests (especially
anything that needs a ceph cluster) and "long-running" unit tests to
teuthology instead. if the tests run in teuthology, we don't need to
run/rerun them every time the pull request is updated. this policy
could get our checks down to 30-45 minutes and significantly reduce
the load on our jenkins machines - all without sacrificing our ability
to ensure the quality of our releases

and while this would take a concerted effort, it's something we could
make incremental progress on. we could start by focusing on the
outliers by runtime, which John Mulligan recently captured over a
handful of 'make check' results:

select name,avg(duration) as dur from stats group by name order by dur
DESC limit 15;
unittest_transaction_manager|4133.85
mgr_dashboard_frontend_unittests|2550.8
unittest_omap_manager|2471.05
run_rbd_unit_tests_61.sh|2306.3
unittest_object_data_handler|2213.95
readable.sh|2081.1
run_rbd_unit_tests_1.sh|2059.55
unittest_seastore|1679.0
unittest_bluefs|1217.15
run_tox_mgr|1008.7
unittest_bufferlist|992.9
smoke.sh|922.65
run_rbd_unit_tests_127.sh|906.4
run_rbd_unit_tests_0.sh|817.25
run_rbd_unit_tests_N.sh|699.9
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx