Ceph is a large and complex distributed system, so requires extensive test coverage against many different configurations. teuthology, "The Ceph integration test framework", was built for this and remains an integral part of our qa process: every major/minor release goes through several teuthology suites for validation, and most pull requests go through at least one suite before merging given this use of teuthology, i'd like to question the role of our github checks (make check, ceph api tests, etc) in the qa process. over the years, we've continued to increase the amount of test coverage in these checks. and while more test coverage is a good thing, i don't think this approach is sustainable. checks often take several hours before completing, so no longer give timely feedback on changes. and because of failure rates, that feedback is often not accurate and may require several reruns. this is a waste of machine and developer time almost all of our github checks are "Required", which prevents pull requests from being merged unless everything is green. required checks are meant to prevent the merging of disruptive changes - for example, breaking the build is extremely disruptive to the whole project. however, the checks themselves can also be disruptive when they fail randomly. because of this, i think we should have a high bar for adding new required checks or extending existing ones it's my opinion that we're just doing way too much in github checks that could be moved to the relevant teuthology suites instead. as long as pull requests end up going through teuthology, we don't need github checks to provide extensive test coverage. i would far prefer we optimize these checks to minimize disruption if we're going to make improvements here, i think we as a group need to agree on: * what role should github checks play in the qa process? * what criteria should we use when deciding what belongs in github checks vs teuthology? please share your opinions here! what follows is my own: i would propose that we move all integration tests (especially anything that needs a ceph cluster) and "long-running" unit tests to teuthology instead. if the tests run in teuthology, we don't need to run/rerun them every time the pull request is updated. this policy could get our checks down to 30-45 minutes and significantly reduce the load on our jenkins machines - all without sacrificing our ability to ensure the quality of our releases and while this would take a concerted effort, it's something we could make incremental progress on. we could start by focusing on the outliers by runtime, which John Mulligan recently captured over a handful of 'make check' results: select name,avg(duration) as dur from stats group by name order by dur DESC limit 15; unittest_transaction_manager|4133.85 mgr_dashboard_frontend_unittests|2550.8 unittest_omap_manager|2471.05 run_rbd_unit_tests_61.sh|2306.3 unittest_object_data_handler|2213.95 readable.sh|2081.1 run_rbd_unit_tests_1.sh|2059.55 unittest_seastore|1679.0 unittest_bluefs|1217.15 run_tox_mgr|1008.7 unittest_bufferlist|992.9 smoke.sh|922.65 run_rbd_unit_tests_127.sh|906.4 run_rbd_unit_tests_0.sh|817.25 run_rbd_unit_tests_N.sh|699.9 _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx