Continuous Integration

When you push a commit to mozilla-central or a related repository, it initiates a large chain of builds and tests across multiple types of infrastructure. This document will help you understand all the pieces that comprise Mozilla's continuous integration systems.

Buildbot, TaskCluster and Treeherder

NOTE: We're transitioning from the Builbot continuous integration system to TaskCluster.

Buildbot, Mozilla's primary continuous integration tool, and TaskCluster pick up changes pushed to Hg . Buildbot/TaskCluster generate binary builds for Firefox, Firefox for Android, and Firefox OS across a variety of operating sytems. After the builds are completed, they are used to run a series of correctness and performance tests.

The results of Buildbot/TaskCluster jobs (both builds and tests) are displayed in Treeherder. There is a group of individuals who are constantly monitoring Treeherder, looking for broken builds and/or tests. These individuals are known as sheriffs. The sheriffs' role is to "keep the tree green", or in other words, to keep the code in our respositories in a good state, to the extent that the state is reflected in the output shown on Treeherder. When sheriffs see a build or test has been broken, they are empowered to take one of several actions, including backing out a patch which caused the problem and closing the tree (i.e., preventing any additional commits).

Results in Treeherder are ordered by Mercurial pushes. Each Buildbot/TaskCluster job is represented by a colored label; green means a job has succeeded, while other colors represent different kinds of problems. The label text indicates the job type. For a full list of job types, see the Help menu in Treeherder's upper-right corner. Below is a list of the most common.

Builds

B - Normal build jobs; these jobs perform compilation and some compiled-code tests (e.g., 'make check').
Be - B2G build jobs for engineering builds; user builds are denoted with B, the same as for desktop and Android builds.
N and Ne - Nightly build jobs; these jobs are similar to B and Be jobs, but are triggered on a periodic basis instead of being triggered by a push to hg.
Hf - Static rooting hazard analysis
S - Static analysis
V - Valgrind build and test jobs; these jobs create valgrind-compatible builds and run a small set of valgrind tests on them.

Functional Tests

These jobs are scheduled after a build job has successfully produced a build and uploaded it to ftp.mozilla.org. These test jobs can sometimes run even if a build job fails, if the build job failed during 'make check'.

See the full list of tests at the Mozilla Automated Testing page.

Talos Performance Tests

All performance tests run in Buildbot and displayed in Treeherder are run using the Talos framework, and denoted by the letter T. These jobs are scheduled at the same time as the correctness jobs. Talos is used to execute several suites for desktop Firefox and Firefox for Android; these suites are denoted using lower-case letters, e.g., T(c d g1 o s tp).

For a list of tests, see the Mozilla Automated Testing page.

The Talos indicators in Treeherder appear green if the job successfully completed; to see the performance data generated by the jobs, click on the performance tab of the job details panel that pops up when you click on a job in Treeherder.

Each Talos suite contains a set of tests or pages, some of these in turn have sub-tests. Each test is executed multiple times to produce a number of data replicates. The Talos harness produces a single number per test (typically the median of all the replicates excluding the first 1-5), which are stored in Treeherder's database, and are accessible via the Perfherder interface.

Other Performance Systems

Most of the performance tests run at Mozilla happen outside of Buildbot. Below is a list of these.

Autophone (Android)

Autophone is a test harness which runs a set of performance tests on a variety of real Android phones. It reports to a custom dashboard known as phonedash. Tests currently run are primarily startup tests.

Games Benchmarking (Firefox)

Under development, the games benchmarking harness (aka mozbench) will allow a number of games-related benchmarks to be run against Firefox and Chrome. Eventually, the system will likely be expanded with support for Android and Firefox OS.

Other Functional Test Systems

In contrast to performance tests, most functional tests are within the Buildbot system. However, a few things run outside of it, and are listed below.

Gaia-ui-tests (Firefox OS)

Gaia-ui-tests are on-device UI tests of Gaia, running in Jenkins, across a range of device types and branches. These tests are Python Marionette-based. Test results are only available in Jenkins currently, and are monitored by QA. Eventually, these tests will report status to Treeherder as well. Gaia-ui-tests are also run in Buildbot against Firefox OS desktop builds, and the status of those is visible in Treeherder.

JSMarionette tests (Firefox OS)

The JSMarionette tests (aka gaia-integration tests) are based on the JavaScript Marionette client. They're run on the same Jenkins instance as the gaia-ui-tests. Like gaia-ui-tests, they are also run in Buildbot against B2G desktop builds. Gaia developers are the primary maintainers of this set of tests.

Post-Job Analysis and Alerts

There is some analysis of test data that occurs out-of-band after jobs complete.

Perfherder Alerts

We track changes to Talos and other performance frameworks inside Perfherder, and try to automatically alert when there is a sustained change exceeding a certain magnitude (specified per test). Performance sheriffs review the list of alerts on a regular basis and file bugs if appropriate. You can view the current set of alerts on the Perfherder Alerts dashboard.

OrangeFactor

After functional tests complete, a separate system collects test log data and combines it with Treeherder's failure classification data. The result is plotted on the OrangeFactor dashboard. The "Orange Factor" is the average number of intermittent test failures that occur per push, and the dashboard can be used to view the most frequent intermittent test failures, as well as to inspect historical trends.