Raptor: Performance Tools for Gaia

This article presents Raptor: a CLI tool for measuring performance specifically on Firefox OS. It looks at the strategy behind the tool's functionality, shows you how to get started with the tool, and moves to some advanced topics such as writing your own tests, visualization, and automation.

While the tooling presented in this article is still functional for performance testing Gaia-based devices, any associated automation and related external web sites are no longer maintained or have been decommissioned; they remain documented here for historial purposes. The documentation for this automation also remains in the event anyone would have use to stand this up on a self-hosted basis.

Raptor aims to overcome many of the pitfalls faced when testing performance with the previous tool, make test-perf:

The test-perf tool relies on Marionette.js to listen for events that each application would emit at key points in their loading lifecycles. This requires an atom script to be injected into every application to bind event listeners for these events. Every time new pseudo-standard events need to be captured, the script has to be modified. This means a lot of maintenance time, on top of the overhead of using Marionette.js itself.
The API for creating performance events is not consistent. In order to make capturing our standard performance events simpler, this is done in test-perf by dispatching a custom event, e.g. window.dispatchEvent(new CustomEvent('moz-app-visually-complete')). Unfortunately if an application wants to emit its own performance event, it has to use a different API from a performance testing helper script.
Every application has to include a performance testing helper script. While this script is necessary for providing access to the API for emitting performance events, it also has its own associated overhead and maintenance.
The test-perf tool is suited to gathering performance metrics for core Gaia applications, but is difficult to extend to handling much outside of that. Applications like Homescreen, System, or other types of interactions outside of application launch are very difficult to test within the confines of the framework.

Raptor is designed to solve these problems, providing a more efficient and extensible performance testing framework that doesn't add so many overheads of its own.

Strategies

This section discusses the strategy undertaken in implementing Raptor's functionality.

User Timing

The User Timing API provides web documents with a mechanism for indicating custom performance marks and measures. Using a standardized API lets applications avoid the need to include a helper script for emitting performance events. In fact, User Timing does not rely on events at all.

// Legacy performance events
window.dispatchEvent(new CustomEvent('moz-app-visually-complete'));
PerformanceTestingHelper.dispatch('settings-load-start');

// User Timing API
performance.mark('visuallyLoaded');

performance.mark('settingsStart');
performance.mark('settingsEnd');

performance.measure('settingsLoad', 'settingsStart', 'settingsEnd');

Logging

In order to capture performance entries in a manner that is decoupled from the application to avoid affecting performance, we opted to output performance metadata in a device's log stream, i.e. adb logcat. Raptor consumes this stream and parses performance entries from the log to gather metrics.

Phases and extensibility

Raptor introduces a concept called "phases", which lays a framework for testing interactions in a generic manner. Currently Raptor supports phases of cold launch, reboot, and B2G restart, with additional phases planned. These work by placing a device in a certain phase before starting performance capturing — make writing actual performance test logic simpler.

Device interaction

Raptor uses the Marionette.js client for familiar device interactions using a high-level API. The same Marionette.js client used for writing integration tests can be used for trigger device actions which contain performance measurements.
Raptor also interacts with devices using the FXOS Device Service. This service exposes a number of interactions via a RESTful interface, such as touch input, reading and writing logs, device restarting, and reading and writing files.

Getting Started

NOTE: While Raptor can be run on emulators, the results should not be relied on for performance comparisons. Desktop computers and their power means that they are not comparable to the performance characteristics of devices and end users, and should not be used for time-based decision making.

Prerequisites

You must have a copy of Gaia v2.2+ available on your system, as well as Node.js v4.2+ installed.

Installing the Raptor CLI tool

Raptor has a CLI tool installable from npm. You can install it via:

$ npm install -g @mozilla/raptor

Once the installation is finished, you can invoke it from the command line via the raptor command:

$ raptor help

Alternate installations

If you aren't comfortable with the way npm installs global packages to your /usr or /usr/local directories, you have a couple of different options:

Change npm's default directory to another directory. Following the steps from npm, you can change where npm installs global packages, possibly by placing them into a special directory in your home folder.
Install Raptor into a local directory and reference it relatively. Example:

$ cd ~
$ mkdir raptor-cli && cd raptor-cli
$ npm install @mozilla/raptor

# Elsewhere
$ ~/raptor-cli/node_modules/@mozilla/raptor/raptor help

# Symlink or add to aliases to save on verbosity
$ cd ~
$ ln -s ~/raptor-cli/node_modules/@mozilla/raptor/raptor raptor

# Now you can use it elsewhere
$ raptor help

Installing the profile

In order to interact with the device in a predictable way, Raptor needs a few profile options and custom settings. The default make command for Raptor optimizes Gaia, disables FTU, enables User Timing to write to logcat, and resets Gaia.

make raptor

If you already have a profile on your device, at a bare minimum you need the following profile options/settings set in order to use Raptor for performance testing:

PERF_LOGGING=1, this sets dom.performance.enable_user_timing_logging in the profile to true.
NOFTU=1, this disables the First-time experience, which is only needed if you are dealing with a freshly-reset Gaia.
SCREEN_TIMEOUT=0, prevents the device from going to sleep and shutting off the screen.
NO_LOCKSCREEN=1, removes the lock screen for easy application launching from the homescreen.

Command-line interface

Raptor provides a bit of helpful information right through the command line:

$ raptor help

Commands:

    help [command]                 Provides help for a given command.
    version                        Outputs the raptor cli tool version
    test [options] <nameOrPath>    Run a performance test by name or path location
    query [options] <measurement>  Run a query against an InfluxDB data source; measurements: measure, memory, mtbf, power
    regression                     Pipe in an InfluxDB query result to search for performance regressions
    submit [options]               Submit piped in performance metrics to an InfluxDB database
    track [options]                Pipe in regression search results to track in an InfluxDB database
    bug [options]                  Pipe in a tracked regression result to automatically file bugs

The core command to execute is the test command, which also has some helpful information:

$ raptor help test

Usage: test [options] 


  Run a performance test by name or path location

  Options:

    --help                              output usage information
    --runs [number]                     Number of times to run a test and aggregate results
    --app [origin]                      Specify the origin or gaiamobile.org prefix of an application to test
    --entryPoint [entrance]             Specify an application entrance point other than the default
    --homescreen [origin]               Specify the origin or gaiamobile.org prefix of an application that is the device homescreen
    --system [origin]                   Specify the origin or gaiamobile.org prefix or an application that is the system application
    --serial [serial]                   Target a specific device for testing
    --adbHost [host]                    Connect to a device on a remote host. Tip: use with --adbPort
    --adbPort [port]                    Port for connecting to a device on a remote host. Use with --adbHost
    --marionetteHost [host]             Connect to Marionette on a remote host. Tip: use with --marionettePort
    --marionettePort [port]             Port for connecting to Marionette on a remote host. Use with --marionetteHost
    --forwardPort [port]                Forward an adb port to the --marionettePort
    --metrics [filepath]                File location to store historical test metrics
    --output [mode]                     stdout output mode: console, json, quiet
    --timeout [milliseconds]            Time to wait between runs for success to occur
    --retries [number]                  Number of times to retry a test or run if a failure or timeout occurs
    --time [epochMilliseconds]          Override the start time and UID of the test
    --logcat [path]                     Write the output from logcat to a file
    --launchDelay [milliseconds]        Time to wait between subsequent application launches
    --memoryDelay [milliseconds]        Time to wait before capturing memory after application fully loaded
    --scriptTimeout [milliseconds]      Time to wait when running scripts via Marionette
    --connectionTimeout [milliseconds]  Marionette driver TCP connection timeout threshold

This should give us enough information to run our first performance test.

Running a performance test

Running a performance test consists of a few parts:

The raptor CLI command
A test to run, whether a named test or a path to a test
Any relevant test settings

For the most basic test, we can do a cold launch test against an application with a command like this:

$ raptor test coldlaunch --app clock

[Cold Launch: clock.gaiamobile.org] Preparing to start testing...
[Cold Launch: clock.gaiamobile.org] Priming application
[Cold Launch: clock.gaiamobile.org] Starting run 1
[Cold Launch: clock.gaiamobile.org] Run 1 complete
[Cold Launch: clock.gaiamobile.org] Results from clock.gaiamobile.org

| Metric                | Mean   | Median | Min    | Max    | StdDev | 95% Bound |
| --------------------- | ------ | ------ | ------ | ------ | ------ | --------- |
| navigationLoaded      | 355    | 355    | 355    | 355    | 0      | 355       |
| navigationInteractive | 425    | 425    | 425    | 425    | 0      | 425       |
| visuallyLoaded        | 496    | 496    | 496    | 496    | 0      | 496       |
| contentInteractive    | 497    | 497    | 497    | 497    | 0      | 497       |
| fullyLoaded           | 497    | 497    | 497    | 497    | 0      | 497       |
| uss                   | 16.195 | 16.195 | 16.195 | 16.195 | 0      | 16.195    |
| rss                   | 35.926 | 35.926 | 35.926 | 35.926 | 0      | 35.926    |
| pss                   | 20.688 | 20.688 | 20.688 | 20.688 | 0      | 20.688    |

[Cold Launch: clock.gaiamobile.org] Testing complete

During the cold launch test, you'll see B2G restart; the stated application will then launch once to prime it, and a second time to measure its performance. Looking at the log output above, you can see when each application run starts and stops. When a particular application has completed its testing, you will be given a table of metrics and testing will continue, if applicable. In the metrics table you'll see statistics for each performance entry captured during the lifespan of the test: mean (average), median, minimum value, maximum value, standard deviation, and 95% Upper Bound.

Note: One fun fact is that the table produced by Raptor is compatible with GitHub-flavored Markdown.

Note: Standard deviation and 95% Upper Bound need a collection of runs before they output statistically-useful data.

All metrics relate to the name of the performance entry. The numbers gathered here are not just aggregations of the values produced by User Timing entries, so it's important to understand how these numbers are derived.

Metrics aggregation

While Raptor relies on the User Timing API to gather its metrics, it also makes some assumptions about measurements that are different to what's expected in the context of normal web pages. In a typical web page, a performance marker represents the High-Resolution time from the moment of navigationStart. The User Timing API still captures this data, but Raptor's calculations also include additional time depending on the type of test running. Let's compare the creation of a performance marker in the context of a typical web page versus a Firefox OS application being cold launched.

Typical web page

In any web page, Firefox OS application or not, creating a performance marker with the User Timing API is simple:

performance.mark('hello');

Now let's get the value back and inspect its contents:

performance.getEntriesByType('mark')[0];

// returns the following object
PerformanceMark { name: "hello", entryType: "mark", startTime: 5159.366323, duration: 0 }

Note the mark's startTime and duration. The startTime is nothing more than the high-resolution time elapsed since the time of performance.timing.navigationStart; in this case a little over 5,000 milliseconds. The duration is 0 because this represents a single point in time, which has no duration. The startTime simply states at what moment the marker was created. Inspecting the output of a performance marker is no different in Firefox OS.

A performance measure on the other hand does include a duration, because it is the delta between two performance markers:

performance.mark('hello');
performance.mark('goodbye');

performance.measure('greeting', 'hello', 'goodbye');

Again, let's inspect the performance entry:

performance.getEntriesByType('measure')[0];

// returns the following object
PerformanceMeasure { name: "greeting", entryType: "measure", startTime: 3528.523661, duration: 4183.291375999805 }

The duration is populated for performance measures, and in this example it took approximately 4.2 seconds to perform a greeting; going from hello to goodbye.

Raptor context

The difference comes in the calculations that Raptor will report. Raptor makes an assumption that all markers generated are actually performance measures in reality, with their duration measured as the time between the application being instructed to launch and the marker being generated. For cold launch, the homescreen application (gaia_grid specifically) creates a special performance marker when an application is launching:

performance.mark('appLaunch@' + appOrigin);

In Raptor, performance markers can be given an @-directive that overrides the context of the marker. If the homescreen instead had invoked performance.mark('appLaunch'), normally we'd assume it is in the application's context. With an @-directive however we can key the performance marker to be against a different application, in essense creating a performance marker for one application inside another. This would evaluate to something like:

performance.mark('[email protected]');

In this case the homescreen is generating a performance marker for the clock application denoting the time of appLaunch. Raptor will then calculate a delta between appLaunch and all performance markers to achieve a more accurate user-perceived time for a marker to be hit. By moving the moment of capture to earlier in the loading process, specifically as close to icon touch as possible, it makes the data between Raptor and camera-based measurements much more comparable.

Choosing a test

Tests are selected by changing the name or file that Raptor executes. For example, to run the device reboot performance test instead of a cold launch test you'd do the following:

$ raptor test reboot

More examples:

# Test Dialer cold launch
$ raptor test coldlaunch --app communications --entry-point dialer

# Change the number of runs
$ raptor test coldlaunch --app clock --runs 10

# Introduce a 1-second delay before capturing memory
$ raptor test reboot --memory-delay 1000

# Target a particular device
$ raptor test reboot --serial f30eccef
$ ANDROID_SERIAL=f30eccef raptor test reboot

# Turn on Raptor debug output, useful for bugs or problems
$ DEBUG=raptor:* raptor test reboot

# JSON mode, useful for post-processing of aggregate values
$ raptor test coldlaunch --app clock --output json

# Quiet mode, useful if you only care about the results
$ raptor test coldlaunch --app clock --output quiet

Writing tests

While Raptor currently contains a few tests for running cold launch tests, rebooting, and restarting B2G, it is possible to write tests that run custom logic.

We can inspect the contents of the current launch test to glean how we can write new tests.

// mozilla-raptor/test
// tests/coldlaunch.js

setup(function(options) {
  options.test = 'cold-launch';
  options.phase = 'cold-launch';
});

afterEach(function(phase) {
  return phase.closeApp();
});

First comes setting up the test. In setup, pass a function to be executed, which will configure the test. This function will be passed all the current configuration settings. At a minimum, you will need the set the phase of the test, which determines the state the device is in when the test begins. Depending on which phase you select when setting options, you may need to pass additional information. For the launch test example, using the cold phase requires an application to be specified. This can either be set on the command line, or you can hard-code it via the app option to force the test to be specific to a certain app.

Note: If you hard-code the application to be launched, make you specify the origin host completely, e.g. "clock.gaiamobile.org". For entry-point-based apps, specify the app option and the entryPoint option.

Important: Any test harness functions doing asynchronous work should return a Promise so Raptor can properly wait.

The afterEach() function will be called once for each run after the phase has been started. For cold launch, it is after an application in context has been primed, exited, and re-opened, and the application denotes it is ready — i.e. performance.mark('fullyLoaded'). For reboot and B2G restart, the phase will be designated as ready when the System application and the Homescreen application are marked as fully loaded.

The phase argument passed to afterEach() represents the current context instance of the phase test runner; in other words, it is specific to the current test being run. It contains methods and functionality that help you trigger device actions which will have profiled performance code. For example, you can start a Marionette.js session and trigger commands:

setup(function(options) {
  options.phase = 'cold-launch';
});

afterEach(function(phase) {
  // Returning a Promise denotes that we are done running the test when it has resolved
  return phase.marionette
    .startSession()
    .then(function(client) {
      client.executeScript(function() {
        // trigger code that captures the performance.measures created
        // by the application being tested
      });
      client.deleteSession();
    });
});

The runner can also run a teardown() function when all tests are complete.

teardown(function(phase) {
  return new Promise(function(resolve) {
    // teardown the test, then resolve
    resolve();
  });
});

The Raptor Phase API has not yet been documented, so currently you'll need to read the source for all the functionality available to you. It may be faster to seek help from a contributor for help on getting started writing a particular test.

Pre-defining parameters

Constantly specifying the parameters for commands which change infrequently can be cumbersome. Fortunately Raptor supports defining command-line parameters through directory-specific .raptorrc files. Raptor will search for .raptorrc files in the current working directory from which a test is being run, and will walk upward until it reaches your home/user directory. This means you can have preset command parameters which differ based on the directory where the test is run from, i.e. different .raptorrc files per directory.

A .raptorrc file can be YAML or JSON, and each top-level key corresponds to a command for which to preset parameters:

{
  "test": {
    "runs": 30,
    "app": "communications",
    "entryPoint": "contacts",
    "logcat": "logcat.txt"
  },
  "submit": {
    "host": "localhost",
    "database": "raptor",
    ...
  },
  "query": {
    ...
  }
}

Visualization and automation

Raptor has improved tooling available for automation and visualization. The test-perf tool used to use the Datazilla tool for graphing and visualizing results to gain insight into possible regressions and performance pulse of applications. Raptor has moved away from Datazilla however for its visualization capabilities — for maintenance and usability reasons — instead having its own UI at https://raptor.mozilla.org. The Raptor dashboards currently categorize performance metrics in a few key categories per device instance — measures and memory — with more metrics planned in the future.

Raptor's front-end uses the Grafana visualization tool, and its backing store is InfluxDB, a time series database. Grafana provides Raptor UI users with the ability to carry out custom drill-downs into charts, slice time as desired, view data point revisions, and build custom charts and data queries. The default view of several charts displays the 95% Upper Bound of many metrics, but charts can be user-edited to graph other mathematical functions.

This guide is not meant to be a tutorial on the usage of Grafana and InfluxDB, so to learn more about taking full advantage of the Raptor dashboards read through these important pieces of documentation:

Automation in Production

Raptor's production automation runs in the QA Jenkins environment. When new TaskCluster builds arrive for the devices we test against, Raptor performs a number of automated tests and reports the metrics to the Raptor dashboards. Each job goes through a number of setup procedures before actually running the performance tests:

Full flash the related TaskCluster build. These are currently based on b2g-inbound.
Install the Raptor CLI tool
Create a Python virtualenv, used for installing reference workloads
Install the Raptor profile with make raptor
Install the light reference workload
Tag the device with testing metadata, e.g. device flame-kk, memory 512, branch master
Inform Treeherder a performance test is in progress
Execute a test, report the metrics to InfluxDB
Report test to Treeherder
Repeat Treeherder reports and test executions if there are more applications to test
Archive test assets and mark the build as a success if all tests completed without error

Detecting Regressions

Raptor's automation uses the same CLI commands to detect regressions that can be used locally. The workflow for detection is as follows:

For a given context (e.g. Contacts, Video, etc.), query InfluxDB for the previous 14 days using the query command.
Pipe the output from querying to the regression command. This will return a JSON array of regressions.
Pipe the output from regression to the track command. This will creating InfluxDB entries for regressions which were previously unknown and return a JSON array of new regressions.
Pipe the output from track to the bug command. This will file bugs in the proper components in Bugzilla and CC relevant stakeholders. Returns a JSON array of bug numbers created.

The code handling the automation of this flow is at https://github.com/mozilla-raptor/prey/blob/master/prey.sh.

Sheriffing Regressions

When a regression is detected and Raptor files a bug, the goal is to have its resolution follow the sheriffing flow in a manner similar to how the Desktop performance sheriffs regulate. This means that upon filing a bug, there should be a resolution in place within 48-72 hours:

Backout the offending patch
Have a follow-up patch in review
Determine acceptance of the regression

Bugs that do not receive attention within the resolution window are subject to immediate backout.

Private visualization

The Raptor dashboard visualization discussed in the previous section can also be installed and used privately. The installation is a Heroku-deployable environment for easy setup. It is also possible to run the Heroku application locally if you use Linux.

To get started with private visualization, or want to learn more about its innards, see the repository: https://github.com/mozilla-raptor/dashboards.

You will also need an installation of InfluxDB 0.9.3+. You can learn more about installing it at: https://influxdb.com/docs/v0.9/introduction/installation.html. Those who are familiar with Docker can also install InfluxDB from Docker Hub: https://hub.docker.com/r/tutum/influxdb/.

Raptor needs CLI options or environment variables for creating a connection to an InfluxDB database. It would be tedious to specify these continually on the command line, so to simplify this, you can values to a .raptorrc file.

{
  "submit": {
    "host": "localhost",
    "port": 8086,
    "database": "raptor",
    "username": "root",
    "password": "root",
    "protocol": "https"
  }
}

In addition, Raptor's database schema requires its results to be tagged properly in order to display it in correct categories in its dashboard UI. Failure to have these properties set when running performance tests will cause the data to not be displayed. By default, you need to persist the memory configuration of the device, the device type, and the branch the performance test is based on. For example, if you are performance testing a KitKat-based Flame set to 512MB of memory and your patch is based off of Gaia's master branch, you will set the following properties via ADB:

$ adb shell setprop persist.raptor.device flame-kk
$ adb shell setprop persist.raptor.memory 512
$ adb shell setprop persist.raptor.branch master

Note: If you are having trouble with the values being persisted or not saving at all, restart ADB as root with adb root.

If you were working on a branch that was based off of v2.5 on an Aries with 2 Gigabytes of memory, you would use the following properties:

$ adb shell setprop persist.raptor.device aries
$ adb shell setprop persist.raptor.memory 2048
$ adb shell setprop persist.raptor.branch v2.5

Important: Currently visualization is highly-dependent on the existence of these persisted properties. They are only necessary when using the local visualization tooling; if you flash your device or otherwise unset these properties, you will need to re-set them in order to visualize performance metrics.

Other than setting up the environment and device tags, Raptor can be run as normal locally. Upon each successful run, Raptor will report its metrics to the database. Once the test is complete, you can open a browser to your private visualization instance and view your own custom performance data.

Adding performance marks dynamically when needed

One issue with Raptor is that since the tests require us to add performance marks into code, the Gaia codebase could quickly become littered with Performance.mark() calls without any meaningful relationship between them, making the code clutted and harder to understand. The best way to deal with this is to collect all the marks into some kind of patching files, and apply them dynamically as required when we want to run specific Raptor tests.

To this end, Greg Weng has created a code transformer tool that will do just what is described above. The tool is currently a work in progress, but you can find more about it (including how to get it running) at this newsgroup entry: Raptor: code transformer + marionette workflow now is almost ready. See also bug 1181069 for implementation specifics.

We will publish more formal instructions once the tool has stabilised.

Support

If you have questions about Raptor, visualization, or performance tooling in general, feel free to ping :Eli or :rwood in the #raptor channel on Mozilla IRC.