Firefox Data Documentation
This document will teach you how to use Firefox data to answer questions about how users interact with our products. The source for this documentation can be found in this repo.
Using this document
This documentation is divided into four main sections:
Getting Started
This section provides a quick introduction to analyzing telemetry data. After reading these articles, you will be able to confidently perform analysis over telemetry data.
Tools
Describes the tools we maintain to access and analyze product data.
Cookbooks & Tutorials
This section contains tutorials presented in a simple problem/solution format.
Data Collection and Datasets
Describes all available data we have from our products. For each dataset, we include a description of the dataset's purpose, what data is included, how the data is collected, and how you can change or augment the dataset. You do not need to read this section end-to-end.
Missing Documentation
We're writing documentation as fast as we can, but there's always going to be confusing or missing documentation. If you can't find what you need, please file a bug.
Reporting a problem
If you have a problem with data tools, datasets, or other pieces of infrastructure, please help us out by reporting it.
Most of our work is tracked in Bugzilla in the Data Platform and Tools product.
Bugs should be filed in the closest-matching component in the Data Platform and Tools product, but if there is no component for the item in question, please file an issue in the General component.
Components are triaged at least weekly by the component owner(s). For issues needing
urgent attention, it is recommended that you use the needinfo
flag to attract attention
from a specific person. If an issue doesn't receive the appropriate attention within a
week, you can send email to the fx-data-dev
mailing list or reach out on IRC
in #datapipeline
.
When a bug is triaged, it will be assigned a priority and points. Priorities have the following meanings:
P1
: in active development in the current sprintP2
: planned to be worked on in the current quarterP3
: planned to be worked on next quarterP4
and beyond: nice to have, we would accept a patch, but not actively being worked on.
Points reflect the amount of effort required for a bug and are assigned as follows:
- 1 point: one day or less of effort
- 2 points: two days of effort
- 3 points: three days to a week of effort
- 5 points or more: SO MUCH EFFORT, major project.
Problems with the data
There are bugzilla components for several of core datasets. described in this documentation, so if possible, please use a specific component.
If there is a problem with a dataset that does not have its own component, please file an issue in the Datasets: General component.
Problems with tools
There are bugzilla components for several of the tools that comprise the Data Platform, so please file a bug in the specific component that most closely matches the tool in question.
Operational bugs, such as services being unavailable, should be filed either in the component for the service itself or in the Operations component.
Other problems
When in doubt, please file issues in the General component.
Terminology
- Analyst: Someone performing analysis. This is more general than data scientist.
- Ping: A message sent from the Firefox browser to our telemetry servers containing information on browser state, user actions, etc... (more details)
- Dataset: A set of data, includes ping data, derived datasets, etc...
- Derived Dataset: A processed dataset, such as
main_summary
or thelongitudinal
dataset - Session: The time from when a Firefox browser starts until it shuts down
- Subsession:
Sessions
are split intosubsessions
when a 24 hour threshold is crossed or an environment change occurs (more details) - ...
Getting Started
This document is meant to be a complete guide to using Firefox Data, so it can look overwhelming at first. These readings will get you up and running quickly After these readings you should be able to produce simple analyses but you should definitely get your analyses reviewed.
This section is meant to introduce new analysts to our data. I consider a "new analyst" to be an employee who is interested in working with our data but doesn't have previous experience with our tools/data. They could be technical or non-technical: engineer, PM, or data scientist.
Getting Started with Firefox Data
Firefox clients out in the wild send us data as pings. Main pings contain some combination of environment data (e.g. operating system, hardware, Firefox version), measurements (e.g. max number of open tabs, time spent running in JavaScript garbage collection), and events. We have quite a few different pings, but most of our data for Firefox Desktop comes in from main pings.
Measurement Types
When we need to measure specific things about clients, we use probes. A single ping will send in many different probes. There are two types of probes that we are interested in here: Histograms and Scalars.
Histograms are bucketed counts. The Histograms.json
file has the definitions for all histograms, which includes the minimum, maximum, and number of buckets. Any recorded value instead just increments its associated bucket. We have four main types of histograms:
- Boolean - Only two buckets, associated with true and false.
- Enumerated - Integer buckets, where usually each bucket has a label.
- Linear - Buckets are divided evenly between the minimum and maximum; e.g. [1-2] is a bucket, and so is [100-101].
- Exponential - Larger valued buckets cover a larger range; e.g. [1-2] is a bucket, and so is [100-200].
To see some of these in action, take a look at the Histogram Simulator.
Scalars are simply a single value. The Scalars.yaml
file has the definitions for all scalars. These values can be integers, strings, or booleans.
TMO
The simplest way to start looking at probe data is to head over to telemetry.mozilla.org
or TMO for short.
From there, you will likely want either the Measurement Dashboard or the Evolution Dashboard. Using these dashboards you can compare a probe's value between populations, e.g. GC_MS
for 64 bit vs. 32 bit, and even track it across builds.
The Measurement Dashboard is a snapshot, aggregating all the data from all chosen dimensions. The Y axis is % of samples, and the X axis is the bucket. You can compare between dimensions, but it does not give you the ability to see how data is changing over time. To investigate that you must use the Evolution Dashboard.
The Evolution Dashboard shows how the data changes over time. Choose which statistics you'd like to plot over time, e.g. Median or 95th percentile. The X axis is time, and the Y axis is the value for whichever statistic you've chosen. This dashboard, for example, shows how GC_MS
is improving from nightly 53 to nightly 56! While the median is not changing much, the 95th percentile is trending down, indicating that long garbage collections are being shortened.
The X axis on the Evolution Dashboard shows either Build ID (a date), or Submission Date. The difference is that on any single date we might receive submissions from lots of builds, but aggregating by Build ID means we can be sure a change was because of a new build.
The second plot on the Evolution View is the number of pings we saw containing that probe (Metric Count).
TMO Caveats
- Data is aggregated on a per-ping basis, meaning these dashboards cannot be used to say something definitive about users. Please repeat that to yourself. A trend in the evolution view may be caused not by a change affecting lots of users, but a change affecting one single user who is now sending 50% of the pings we see. And yes, that does happen.
- Sometimes it looks like no data is there, but you think there should be. Check under advanced settings and check "Don't Sanitize" and "Don't Trim Buckets". If it's still not there, let us know in IRC on #telemetry.
- TMO Measurement Dashboards do not currently show release-channel data. Release-channel data ceased being aggregated as of Firefox 58. We're looking into ways of doing this correctly in the near future.
Where to go next
- Analysis using STMO
- Advanced analysis with ATMO
- Experimental data
- Adding probes, collecting more data
- Augmenting the derived datasets
Choosing a Dataset
This document will help you find the best data source for a given analysis.
This guide focuses on descriptive datasets and does not cover experimentation. For example, this guide will help if you need to answer questions like:
- How many users do we have in Germany, how many crashes we see per day?
- How many users have a given addon installed?
If you're interested in figuring out whether there's a causal link between two events take a look at our tools for experimentation.
Table of Contents
- Raw Pings
- Main Ping Derived Datasets
- Crash Ping Derived Datasets
- New-Profile Derived Datasets
- Update Derived Dataset
- Other Datasets
- Obsolete Datasets
- Appendix
Raw Pings
We receive data from our users via pings. There are several types of pings, each containing different measurements and sent for different purposes. To review a complete list of ping types and their schemata, see this section of the Mozilla Source Tree Docs.
Many pings are also described by a JSONSchema specification which can be found in this repository.
Background and Caveats
The large majority of analyses can be completed using only the main ping. This ping includes histograms, scalars, events, and other performance and diagnostic data.
Few analyses actually rely directly on the raw ping data. Instead, we provide derived datasets which are processed versions of these data, made to be:
- Easier and faster to query
- Organized to make the data easier to analyze
- Cleaned of erroneous or misleading data
Before analyzing raw ping data, check to make sure there isn't already a derived dataset made for your purpose. If you do need to work with raw ping data, be aware that loading the data can take a while. Try to limit the size of your data by controlling the date range, etc.
Accessing the Data
You can access raw ping data from an ATMO cluster using the Dataset API. Raw ping data are not available in re:dash.
Further Reading
You can find the complete ping documentation. To augment our data collection, see Collecting New Data and the Data Collection Policy.
Main Ping Derived Datasets
The main ping contains most of the measurements used to track performance and health of Firefox in the wild. This ping includes histograms, scalars, and events.
This section describes the derived datasets we provide to make analyzing this data easier.
longitudinal
The longitudinal
dataset is a 1% sample of main ping data
organized so that each row corresponds to a client_id
.
If you're not sure which dataset to use for your analysis,
this is probably what you want.
Contents
Each row in the longitudinal
dataset represents one client_id
,
which is approximately a user.
Each column represents a field from the main ping.
Most fields contain arrays of values, with one value for each ping associated with a client_id
.
Using arrays give you access to the raw data from each ping,
but can be difficult to work with from SQL.
Here's a query showing some sample data
to help illustrate.
Take a look at the longitudinal examples if you get stuck.
Background and Caveats
Think of the longitudinal table as wide and short.
The dataset contains more columns than main_summary
and down-samples to 1% of all clients to reduce query computation time and save resources.
In summary, the longitudinal table differs from main_summary
in two important ways:
- The longitudinal dataset groups all data so that one row represents a
client_id
- The longitudinal dataset samples to 1% of all
client_id
s
Please note that this dataset only contains release (or opt-out) histograms and scalars.
Accessing the Data
The longitudinal
is available in re:dash,
though it can be difficult to work with the array values in SQL.
Take a look at this example query.
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/longitudinal/
main_summary
The main_summary
table is the most direct representation of a main ping
but can be difficult to work with due to its size.
Prefer the longitudinal
dataset unless using the sampled data is prohibitive.
Contents
The main_summary
table contains one row for each ping.
Each column represents one field from the main ping payload,
though only a subset of all main ping fields are included.
This dataset does not include histograms.
Background and Caveats
This table is massive, and due to its size, it can be difficult to work with.
You should avoid querying main_summary
from re:dash.
Your queries will be slow to complete and can impact performance for other users,
since re:dash on a shared cluster.
Instead, we recommend using the longitudinal
or clients_daily
dataset where possible.
If these datasets do not suffice, consider using Spark on an
ATMO cluster.
In the odd case where these queries are necessary,
make use of the sample_id
field and limit to a short submission date range.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/main_summary/v4/
Though not recommended main_summary
is accessible through re:dash.
Here's an example query.
Your queries will be slow to complete and can impact performance for other users,
since re:dash is on a shared cluster.
Further Reading
The technical documentation for main_summary
is located in the
telemetry-batch-view documentation.
The code responsible for generating this dataset is here
first_shutdown_summary
The first_shutdown_summary
table is a summary of the first-shutdown
ping.
Contents
The first shutdown ping contains first session usage data. The
dataset has rows similar to the
telemetry_new_profile_parquet
,
but in the shape of
main_summary
.
Background and Caveats
Ping latency was reduced through the shutdown ping-sender mechanism in Firefox 55. To maintain consistent historical behavior, the first main ping is not sent until the second start up. In Firefox 57, a separate first-shutdown ping was created to evaluate first-shutdown behavior while maintaining backwards compatibility.
In many cases, the first-shutdown ping is a duplicate of the main ping. The first-shutdown summary can be used in conjunction with the main summary by taking the union and deduplicating on the document_id
.
Accessing the Data
The data can be accessed as first_shutdown_summary
. It is currently stored in the following path.
s3://telemetry-parquet/first_shutdown_summary/v4/
The data is backfilled to 2017-09-22, the date of its first nightly appearance. This data should be available to all releases on and after Firefox 57.
client_count_daily
The client_count_daily
dataset is useful for estimating user counts over a few
pre-defined dimensions.
The client_count_daily
dataset is similar to the deprecated
client_count
dataset
except that is aggregated by submission date and not activity date.
Content
This dataset includes columns for a dozen factors and an HLL variable.
The hll
column contains a
HyperLogLog
variable, which is an approximation to the exact count.
The factor columns include submission date and the dimensions listed
here.
Each row represents one combinations of the factor columns.
Background and Caveats
It's important to understand that the hll
column is not a standard count.
The hll
variable avoids double-counting users when aggregating over multiple days.
The HyperLogLog variable is a far more efficient way to count distinct elements of a set,
but comes with some complexity.
To find the cardinality of an HLL use cardinality(cast(hll AS HLL))
.
To find the union of two HLL's over different dates, use merge(cast(hll AS HLL))
.
The Firefox ER Reporting Query
is a good example to review.
Finally, Roberto has a relevant write-up
here.
Accessing the Data
The data is available in Re:dash. Take a look at this example query.
I don't recommend accessing this data from ATMO.
Further Reading
churn
The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.
Content
Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0)
for groups of clients with some shared attributes. A group of clients with
shared attributes is called a cohort. The cohorts in this dataset are created
every week and can be tracked over time using the acquisition_date
and the
weeks since acquisition or current_week
.
The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.
client | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | X | X | |||||||||||||
B | X | X | X | X | X | X | |||||||||
C | X | X |
All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 3.
The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..
client | 0 | 1 | 2 |
---|---|---|---|
A | X | X | |
B | X | ||
C | X | X |
The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.
Background and Caveats
- Each row in this dataset describes a unique segment of users
- The number of rows is exponential with the number of dimensions
- New fields should be added sparing to account for data-set size
- The dataset lags by 10 days in order account for submission latency
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
However,
churn_v3
still tracks releases older than Firefox 55.
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
However,
- The start of the period is fixed to Sundays. Once it has been aggregated, the
period cannot be shifted due to the way clients are counted.
- A supplementary 1-day
retention
dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO withmain_summary
orclients_daily
datasets.
- A supplementary 1-day
Accessing the Data
churn
is available in Re:dash under Athena and Presto. The data is also
available in parquet for consumption by columnar data engines at
s3://telemetry-parquet/churn/v3
.
retention
The retention
table provides client counts relevant to client retention at a
1-day granularity. The project is tracked in Bug 1381840
Contents
The retention
table contains a set of attribute columns used to specify a
cohort of users and a set of metric columns to describe cohort activity. Each
row contains a permutation of attributes, an approximate set of clients in a
cohort, and the aggregate engagement metrics.
This table uses the HyperLogLog (HLL) sketch to create an approximate set of
clients in a cohort. HLL allows counting across overlapping cohorts in a single
pass while avoiding the problem of double counting. This data-structure has the
benefit of being compact and performant in the context of retention analysis,
at the expense of precision. For example, calculating a 7-day retention period
can be obtained by aggregating over a week of retention data using the union
operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT
over client_id
's in the 7-day window.
Background and Caveats
- The data starts at 2017-03-06, the merge date where Nightly started to
track Firefox 55 in Mozilla-Central. However, there was
not a consistent view into the behavior of first session profiles until the
new_profile
ping. This means much of the data is inaccurate before 2017-06-26. - This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
- Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
- The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.
Accessing the Data
The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.
The column should first be cast to the HLL type. The scalar
cardinality(<hll_column>)
function will approximate the number of unique
items per HLL object. The aggregate merge(<hll_column>)
function will perform
the set union between all objects in a column.
Example: Cast the count column into the appropriate type.
SELECT cast(hll as HLL) as n_profiles_hll FROM retention
Count the number of clients seen over all attribute combinations.
SELECT cardinality(cast(hll as HLL)) FROM retention
Group-by and aggregate client counts over different release channels.
SELECT channel, cardinality(merge(cast(hll AS HLL))
FROM retention
GROUP BY channel
The HyperLogLog library wrappers are available for use outside of the
configured STMO environment, spark-hyperloglog
and
presto-hyperloglog
.
Also see the client_count_daily
dataset.
clients_daily
The clients_daily
table is intended as the first stop for asking questions
about how people use Firefox. It should be easy to answer simple questions.
Each row in the table is a (client_id
, submission_date
) and contains a
number of aggregates about that day's activity.
Contents
Many questions about Firefox take the form "What did clients with
characteristics X, Y, and Z do during the period S to E?" The
clients_daily
table is aimed at answer those questions.
Accessing the Data
The data is stored as a parquet table in S3 at the following address.
s3://telemetry-parquet/clients_daily/v6/
The clients_daily
table is accessible through re:dash using the Athena
data source. It is also available via the Presto
data source, though
Athena
should be preferred for performance and stability reasons.
Here's an example query.
Crash Ping Derived Datasets
The crash ping
is captured after the main Firefox process crashes or after a content process crashes,
whether or not the crash report is submitted to crash-stats.mozilla.org
.
It includes non-identifying metadata about the crash.
This section describes the derived datasets we provide to make analyzing this data easier.
crash_aggregates
The crash_aggregates
dataset compiles crash statistics over various dimensions for each day.
Rows and Columns
There's one column for each of the stratifying dimensions and the crash statistics. Each row is a distinct set of dimensions, along with their associated crash stats. Example stratifying dimensions include channel and country, example statistics include usage hours and plugin crashes. See the complete documentation for all available dimensions and statistics.
Accessing the Data
This dataset is accessible via re:dash.
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/crash_aggregates/v1/
Further Reading
The technical documentation for this dataset can be found in the telemetry-batch-view documentation
crash_summary
The crash_summary
table is the most direct representation of a crash ping.
Contents
The crash_summary
table contains one row for each crash ping.
Each column represents one field from the crash ping payload,
though only a subset of all crash ping fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/crash_summary/v1/
crash_summary
is accessible through re:dash.
Here's an example query.
Further Reading
The technical documentation for crash_summary
is located in the
telemetry-batch-view documentation.
The code responsible for generating this dataset is here
New-Profile Derived Datasets
The new-profile ping is sent from Firefox Desktop on the first session of a newly created profile and contains the initial information about the user environment.
This data is available in the telemetry_new_profile_parquet
dataset.
The telemetry_new_profile_parquet
table is the most direct representation of a new-profile ping.
Contents
The table contains one row for each ping. Each column represents one field from the new-profile ping payload, though only a subset of all fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-new-profile-parquet/v2/
The telemetry_new_profile_parquet
is accessible through re:dash.
Here's an example query.
Further Reading
This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1360256.
Update Derived Dataset
The update ping
is sent from Firefox Desktop when a browser update is ready to be applied and after it was correctly applied.
It contains the build information and the update blob information, in addition to some information about the
user environment.
The telemetry_update_parquet
table is the most direct representation of an update ping.
Contents
The table contains one row for each ping. Each column represents one field from the update ping payload, though only a subset of all fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-update-parquet/v1/
The telemetry_update_parquet
is accessible through re:dash.
Here's an example query.
Further Reading
This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1384861.
Other Datasets
Public crash statistics for Firefox are available through the Data Platform in a socorro_crash
dataset.
The crash data in Socorro is sanitized and made available to ATMO and STMO.
A nightly import job converts batches of JSON documents into a columnar format using the associated JSON Schema.
Contents
Accessing the Data
The dataset is available in parquet at s3://telemetry-parquet/socorro_crash/v2
.
It is also indexed with Athena and Presto with the table name socorro_crash
.
Obsolete Datasets
heavy_users
The heavy_users
table provides information about whether a given client_id
is
considered a "heavy user" on each day (using submission date).
Contents
The heavy_users
table contains one row per client-day, where day is
submission_date
. A client has a row for a specific submission_date
if
they were active at all in the 28 day window ending on that submission_date
.
A user is a "heavy user" as of day N if, for the 28 day period ending
on day N, the sum of their active_ticks
is in the 90th percentile (or
above) of all clients during that period. For more analysis on this,
and a discussion of new profiles, see
this link.
Background and Caveats
- Data starts at 20170801. There is technically data in the table before
this, but the
heavy_user
column isNULL
for those dates because it needed to bootstrap the first 28 day window. - Because it is top the 10% of clients for each 28 day period, more
than 10% of clients active on a given
submission_date
will be considered heavy users. If you join with another data source (main_summary
, for example), you may see a larger proportion of heavy users than expected. - Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
- There is no "fixing" or weighting of new profiles - days before the
profile was created are counted as zero
active_ticks
. Analyses may need to use the includedprofile_creation_date
field to take this into account.
Accessing the Data
The data is available both via sql.t.m.o
and Spark.
In Spark:
spark.read.parquet("s3://telemetry-parquet/heavy_users/v1")
In SQL:
SELECT * FROM heavy_users LIMIT 3
Further Reading
The code responsible for generating this dataset is here
Appendix
Mobile Metrics
There are several tables owned by the mobile team documented here:
android_events
android_clients
android_addons
mobile_clients
Introduction
STMO is shorthand for
sql.telemetry.mozilla.org
, an installation
of the excellent Re:dash data analysis and dashboarding
tool that has been customized and configured for use with a number of the
Firefox organization's data sets. As the name and URL imply, effective use of
this tool requires familiarity with the
SQL query language, with which all of the
tool's data extraction and analysis are performed.
Concepts
There are three building block from which analyses in STMO are constructed: queries, visualizations, and dashboards.
Queries
STMO's basic unit of analysis is the query. A query is a block of SQL code that
extracts and (optionally) transforms data from a single data source. Queries
can vary widely in complexity. Some queries are trivial one liners
(e.g. SELECT * FROM tablename LIMIT 10
), while others are many pages long,
small programs in their own right.
The raw output from a query is tabular data, where each row is one set of return values for the query's output columns. A query can be run manually or it can be specified to have a refresh schedule, where it will execute automatically after a specified interval of time.
Visualizations
Tabular data is great, but rarely is a grid of values the best way to make sense of your data. Each query can be associated with multiple visualizations, each visualization rendering the extracted data in some more useful format. There are many visualization types, including charts (line, bar, area, pie, etc.), counters, maps, pivot tables, and more. Each visualization type provides a set of configuration parameters that allow you to specify how to map from the raw query output to the desired visualization. Some visualization types make demands of the query output; a map visualization requires each row to contain a longitude value and a latitude value, for instance.
Dashboards
A dashboard is a collection of visualizations, combined into a single visual presentation for convenient perusal. A dashboard is decoupled from any particular queries. While it is possible to include multiple visualizations from a single query in one dashboard, it is not required; users can add any visualizations they can access to the dashboards they create.
Data Sources
SQL provides the ability to extract and manipulate the data, but you won't get very far without having some familiarity with what data is actually available, and how that data is structured. Each query in STMO is associated with exactly one data source, and you have to know ahead of time which data source contains the information that you need. One of the most commonly used data sources is called Athena (referring to Amazon's Athena query service, on which it is built), which contains most of the data that is obtained from telemetry pings received from Firefox clients. The Athena source is slowly replacing the Presto data source. Presto contains all of the data that's exposed via Athena and more, but returns query results much more slowly.
Other available data sources include Crash DB, Tiles, Sync Stats, Push,
Test Pilot, ATMO, and even a Re:dash metadata which connects to STMO's
own Re:dash database. You can learn more about the available data sets and how
to find the one that's right for you on the Choosing a
dataset page. If you have data set
questions, or would like to know if specific data is or can be made available
in STMO, please inquire in the #datapipeline
or #datatools
channels on
irc.mozilla.org
.
Creating an Example Dashboard
The rest of this document will take you through the process of creating a simple dashboard using STMO.
Creating A Query
We start by creating a query. Our first query will count the number of client ids that we have coming from each country, for the top N countries. Clicking on the 'New Query' button near the top left of the site will bring you to the query editing page:
For this (and most queries where we're counting distinct client IDs) we'll want
to use the client_count_daily
data
set that is generated from
Firefox telemetry pings.
-
Check if the data set is in Athena
As mentioned above, Athena is faster than Presto, but not all data sets are yet available in Athena. We can check to see if the one we want is available in the by typing
client_count_daily
into the "Search schema..." search box above the schema browser interface to the left of the main query edit box. As of this writing, alas, there are no matches forclient_count_daily
, which means this data set is not available in Athena. -
Verify the data set exists in Presto
It's not in Athena, so now we should check to see if it's in Presto. If you click on the 'Data Source' drop-down and change the selection from 'Athena' to 'Presto' (with
client_count_daily
still populating the filter input), you should see that there is, in fact, aclient_count_daily
data set (showing up asdefault.client_count_daily
), as well as versionedclient_count_daily
data sets (showing up asdefault.client_count_daily_v<VERSION>
). -
Introspect the available columns
Clicking on the
default.client_count_daily
in the schema browser exposes the columns that are available in the data set. Two of the columns are of interest to us for this query:country
(for obvious reasons) andhll
.
The hll
column bears some explanation. hll
stands for
HyperLogLog, a sophisticated
algorithm that allows for the counting of extremely high numbers of items,
sacrificing a small amount of accuracy in exchange for using much less memory
than a simple counting structure. The client_count_daily
data set uses the hll
column for all of its counting functionality. Converting the hll
value back
to a regular numeric value requires the use of the following magic SQL
incantation:
cardinality(merge(cast(hll as HLL)))
So a query that extracts all of the unique country values and the count for each one, sorted from highest count to lowest count looks like this:
SELECT country,
cardinality(merge(cast(hll AS HLL))) AS client_count
FROM client_count_daily
GROUP BY 1
ORDER BY 2 DESC
If you type that into the main query edit box and then click on the "Execute" button, you should see a blue bar appear below the edit box containing the text "Executing query..." followed by a timer indicating how long the query has been running. After a reasonable (for some definition of "reasonable", usually about one to two minutes) amount of time the query should complete, resulting in a table showing a client count value for each country. Congratulations, you've just created and run your first STMO query!
Now would be a good time to click on the large "New Query" text near the top of
the page; it should turn into an edit box, allowing you to rename the
query. For this exercise, you should use a unique prefix (such as your name)
for your query name, so it will be easy to find your query later; I used
rmiller:Top Countries
.
Creating A Visualization
Now that the query is created, we'll want to provide a simple
visualization. The table with results from the first query execution should be
showing up underneath the query edit box. Next to the TABLE
heading should be
another heading entitled +NEW VISUALIZATION
.
Clicking on the +NEW VISUALIZATION
link should bring you to the
"Visualization Editor" screen, where you can specify a visualization name ("Top
Countries bar chart"), a chart type ("Bar"), an x-axis column (country
), and
a y-axis column (client_count
).:
After the GENERAL
settings have been specified, we'll want to tweak a few
more settings on the X AXIS
tab. You'll want to click on this tab and then
change the 'Scale' setting to 'Category', and un-check the 'Sort Values'
check-box to allow the query's sort order to take precedence:
A Note About Limits
Once you save the visualization settings and return to the query source page,
you should have a nice bar graph near the bottom of the page. You may notice,
however, that the graph has quite a long tail. Rather than seeing all of
the countries, it might be nicer to only see the top 20. We can do this by adding
a LIMIT
clause to the end of our query:
SELECT country, cardinality(merge(cast(hll AS HLL))) AS client_count FROM client_count_daily GROUP BY 1 ORDER BY 2 DESC LIMIT 20
If you edit the query to add a limit clause and again hit the 'Execute' button, you should get a new bar graph that only shows the 20 countries with the highest number of unique clients. In this case, the full data set has approximately 250 return values, and so limiting the result count improves readability. In other cases, however, an unlimited query might return thousands or even millions of rows. When those queries are run, readability is not the only problem; queries that return millions of rows can tax the system, failing to return the desired results, and negatively impacting the performance of all of the other users of the system. Thus, a very important warning:
ALL QUERIES SHOULD INCLUDE A "LIMIT" STATEMENT BY DEFAULT!
You should be in the habit of adding a "LIMIT 100" clause to the end of all new queries, to prevent your query from returning a gigantic data set that causes UI and performance problems. You may learn that the total result set is small enough that the limit is unnecessary, but unless you're certain that is the case specifying an explicit LIMIT helps prevent unnecessary issues.
Query Parameters
We got our chart under control by adding a "LIMIT 20" clause at the end. But what if we only want the top 10? Or maybe sometimes we want to see the top 30? We don't always know how many results our users will want. Is it possible to allow users to specify how many results they want to see?
As you've probably guessed, I wouldn't be asking that question if the answer wasn't "yes". STMO allows queries to accept user arguments by the use of double curly-braces around a variable name. So our query now becomes the following:
SELECT country, cardinality(merge(cast(hll AS HLL))) AS client_count FROM client_count_daily GROUP BY 1 ORDER BY 2 DESC LIMIT {{country_count}}
Once you replace the hard coded limit value with {{country_count}}
you should
see an input field show up directly above the bar chart. If you enter a numeric
value into this input box and click on 'Execute' the query will run with the
specified limit. Clicking on the 'Save' button will then save the query, using
the entered parameter value as the default.
Creating A Dashboard
Now we can create a dashboard to display our visualization. Do this by clicking on the 'Dashboards' drop-down near the top left of the page, and then clicking the 'New Dashboard' option. Choose a name for your dashboard, and you will be brought to a mostly empty page. Clicking on the '...' button near the top right of the page will give you the option to 'Add Widget'. This displays the following dialog:
The "Search a query by name" field is where you can enter in the unique prefix used in your query name to find the query you created. This will not yet work, however, because your query isn't published. Publishing a query makes it show up in searches and on summary pages. Since this is only an exercise, we won't want to leave our query published, but it must be published briefly in order to add it to our dashboard. You can publish your query by returning to the query source page and clicking the "Publish" button near the top right of the screen.
Once your query is published, it should show up in the search results when you
type your unique prefix into the "Search a query by name" field on the "Add
Widget" dialog. When you select your query, a new "Choose Visualization"
drop-down will appear, allowing you to choose which of the query's
visualizations to use. Choose the bar chart you created and then click on the
"Add to Dashboard" button. Voila! Now your dashboard should have a bar chart,
and you should be able to edit the country_count
value and click the refresh
button to change the number of countries that show up on the chart.
Completing the Dashboard
A dashboard with just one graph is a bit sad, so let's flesh it out by adding a handful of additional widgets. We're going to create three more queries, each with a very similar bar chart. The text of the queries will be provided here, but creating the queries and the visualizations and wiring them up to the dashboard will be left as an exercise to the user. The queries are as follows:
-
Top OSes (recommended
os_count
value == 6)SELECT os, cardinality(merge(cast(hll AS HLL))) AS client_count FROM client_count_daily GROUP BY 1 ORDER BY 2 DESC LIMIT {{os_count}}
-
Channel Counts
SELECT normalized_channel AS channel, cardinality(merge(cast(hll AS HLL))) AS client_count FROM client_count_daily GROUP BY 1 ORDER BY 2 DESC
-
App Version Counts (recommended
app_version_count value
== 20)SELECT app_name, app_version, cardinality(merge(cast(hll AS HLL))) AS client_count FROM client_count_daily GROUP BY 1, 2 ORDER BY 3 DESC LIMIT {{app_version_count}}
Creating bar charts for these queries and adding them to the original dashboard can result in a dashboard resembling this:
Some final notes to help you create your dashboards:
-
Don't forget that you'll need to publish each of your queries before you can add its visualizations to your dashboard.
-
Similarly, it's a good idea to un-publish any test queries after you've used them in a dashboard so as not to permanently pollute everyone's search results with your tests and experiments. Queries that are the result of actual work-related analysis should usually remain published, so others can see and learn from them.
-
The 'Firefox' label on the 'App Version counts' graph is related to the use of the 'Group by' visualization setting. I encourage you to experiment with the use of 'Group by' in your graphs to learn more about how this can be used.
-
This tutorial has only scratched the surface of the wide variety of very sophisticated visualizations supported by STMO. You can see a great many much more sophisticated queries and dashboards by browsing around and exploring the work that has been published by others.
-
The Re:dash help center is useful for further deep diving into Re:dash and all of its capabilities.
Prototyping Queries
Sometimes you want to start working on your query before the data is available. You can do this with most of the data sources by selecting a static test data set, then working with it as usual. You can also use this method to explore how a given SQL backend behaves.
Note that UNION ALL
will retain duplicate rows while UNION
will discard them.
Here are a couple of examples:
Simple three-column test dataset
WITH test AS (
SELECT 1 AS client_id, 'foo' AS v1, 'bar' AS v2 UNION ALL
SELECT 2 AS client_id, 'bla' AS v1, 'baz' AS v2 UNION ALL
SELECT 3 AS client_id, 'bla' AS v1, 'bar' AS v2 UNION ALL
SELECT 2 AS client_id, 'bla' AS v1, 'baz' AS v2 UNION ALL
SELECT 3 AS client_id, 'bla' AS v1, 'bar' AS v2
)
SELECT * FROM test
Convert a semantic version string to a sortable array field
WITH foo AS (
SELECT '1.0.1' AS v UNION
SELECT '1.10.3' AS v UNION
SELECT '1.0.2' AS v UNION
SELECT '1.1' AS v UNION
-- Doesn't work with these type of strings due to casting
-- SELECT '1.3a1' AS v UNION
SELECT '1.2.1' AS v
)
SELECT cast(split(v, '.') AS array<bigint>) FROM foo ORDER BY 1
How do boolean fields get parsed from strings?
WITH bar AS (
SELECT '1' AS b UNION
SELECT '0' UNION
SELECT 't' UNION
SELECT 'f' UNION
SELECT 'true' UNION
SELECT 'false' UNION
SELECT 'turkey'
)
SELECT b, try(cast(b AS boolean)) from bar
Analysis Gotchas
When performing analysis on any data there are some mistakes that are easy to make and details that are easy to overlook. Do you know exactly what question you hope to answer? Is your sample representative of your population? Is your result "real"? How precisely can you state your conclusion?
This document is not about those traps. Instead, it is about quirks and pitfalls specific to Telemetry.
Pseudo-replication
Telemetry data is a collection of pings. A single main-ping represents a single subsession. Some clients have more subsessions than others.
So when you say "63% of beta 53 has Firefox set as its default browser", make sure you specify it is 63% of pings, since it is only around 46% of clients. (Apparently users with Firefox Beta 53 set as their default browser submit more main-pings than users who don't).
Profiles vs Users
In the section above you'll notice I said "clients" not "users." That is because of all the things we're able to count, users isn't one of them.
Users can have multiple Firefox profiles running on the same computer at the same time (like developers).
Users can have the same Firefox profile running on several computers on different days of the week (also developers).
The only things we can count are pings and clients. Clients we can count because we send a client_id
with each ping that uniquely identifies the profile from which it came. This is generally close enough to our idea of "user" that we can get away with counting profiles and calling them users, but you may run into some instances where the distinction matters.
When in doubt, be precise. You're counting clients.
Opt-in vs Opt-out
We don't collect the same information from everyone.
Every profile that doesn't have Telemetry disabled sends us "opt-out" Telemetry. This includes:
- Nearly everything in the Environment
- Some very specific Histograms, Scalars, and Events that are marked
"releaseChannelCollection": "opt-out"
Most probes are "opt-in": we do not get information from them unless the user opts into sending us this information. Users can opt-in in two ways:
- Using Firefox's Options UI to tick the box that gives us permission
- Installing any pre-release version of Firefox
The nature of selection bias is such that the population in #1 is useless for analysis. If you want to encourage users to collect good information for us, ask them to install Beta: it's only slightly harder than finding and checking the opt-in check-box in Firefox.
Trusting Dates
Don't trust client times.
Any timestamp recorded by the user is subject to "clock skew." The user's clock can be set (purposefully or accidentally) to any time at all. The nature of SSL certificates tends to keep this within a certain relatively-accurate window, because a user who's clock is too far in the past or too far in the future might confuse certain expiration-date-checking code.
Examples of client times: crashDate
, crashTime
, meta/Date
, sessionStartDate
, subsessionStartDate
, profile/creationDate
Examples of server times you can trust: Timestamp
, submission_date
Note that submissionDate
does not appear in the
ping documentation
because it is added in post-processing.
It can be found in the meta
field of the ping as in the
Databricks Example.
Date Formats
Not all dates and times are created equal. Most of the dates and times in Telemetry pings are ISO 8601. Most are full timestamps, though their resolution may differ from being per-second to being per-day.
Then there's profile/creationDate
which is just a number of days since epoch. Like 17177
for the date 2017-01-11.
Tip: To convert profile/creationDate
to a usable date in SQL: DATE_ADD('day', profile_created, DATE '1970-01-01')
In derived datasets ISO dates are sometimes converted to strings in one of two formats: %Y-%m-%d
or %Y%m%d
.
The date formats for different rows in main_summary
are described on the main_summary
reference page.
Build ids look like dates but aren't. If you take the first eight characters you can use that as a proxy for the day the build was released.
metadata/Date
is an HTTP Date header in a RFC 7231-compatible format.
Tip: To parse metadata/Date
to a usable date in SQL: DATE_PARSE(SUBSTR(client_submission_date, 1, 25), '%a, %d %b %Y %H:%i:%s')
Delays
Telemetry data takes a while to get into our hands. The largest data mule in Telemetry is the main-ping. It is (pending bug 1336360) sent at the beginning of a client's next Firefox session. If the user shuts down their Firefox for the weekend, we won't get their Friday data until Monday morning.
A rule of thumb is data from two days ago is usually fairly representative.
If you'd like to read more about this subject and look at pretty graphs, there are a series of blog posts here, here and here.
Pingsender
Pingsender greatly reduces delay in sending pings to Mozilla, but only some types of pings are sent by Pingsender. Bug 1310703 introduced Pingsender for crash pings and was merged in Firefox 54, which hit release on June 13, 2017. Bug 1336360 moved shutdown pings to Pingsender and was merged in Firefox 55, which hit release on August 8, 2017. Bug 1374270 added sending health pings on shutdown via Pingsender and was merged in Firefox 56, which hit release on Sept 28, 2017. Other types of pings are not sent with Pingsender. This is usually okay because Firefox is expected to continue running long enough to send those pings.
Mobile clients do not have Pingsender, so they suffer delay as given in this query.
Submission Date
submission_date
is the server time at which a ping is received from the client. We use it to
partition many of our data sets.
In bug 1422892 we decided to standardize
on submission_date
.
TL;DR
- not subject to client clock skew
- doesn't require normalization
- good for backfill
- good for daily processing
- and usually good enough
Optimizing SQL Queries
After you write a query in STMO, you can make big steps to improve performance by understanding how data is stored, what databases are doing under the covers, and what you can change about your query to take advantage of those two pieces.
Note that this advice is most relevant for the Presto
, Athena
, and Presto-Search
data sources, as well as Spark SQL
and Spark notebooks in general.
TL;DR: What to do for quick improvements
- Switch to Athena
- Filter on a partitioned column† (even if you have a
LIMIT
) - Select the columns you want explicitly (Don't use
SELECT *
) - Use approximate algorithms: e.g.
approx_distinct(...)
instead ofCOUNT(DISTINCT ...)
† Partitioned columns can be identified in the Schema Explorer in re:dash.
They are the first few columns under a table name, and their name is preceded by a [P]
.
Some Explanations
There are a few key things to understand about our data storage and these databases to learn how to properly optimize queries.
What are these databases?
The databases we use are not traditional relational databases like PostgreSQL or MySQL. They are distributed SQL engines, where the data is stored separately from the cluster itself. They include multiple machines all working together in a coordinated fashion. This is why the clusters can get slow when there are lots of competing queries - because the queries are sharing resources.
Note that Athena is serverless, which is why we recommend people use that when they can.
How does this impact my queries?
What that means is that multiple machines will be working together to get the result of your query. Because there is more than one machine, we worry a lot about Data Shuffles: when all of the machines have to send data around to all of the other machines.
For example, consider the following query, which gives the number of rows present for each
client_id
:
SELECT client_id, COUNT(*)
FROM main_summary
GROUP BY client_id
The steps that would happen are this:
- Each machine reads a different piece of the data, and parses out the
client_id
for each row. Internally, it then computes the number of rows seen for eachclient_id
, but only for the data that it read. - Each machine is then given a set of
client_id
s to aggregate. For example, the first machine may be told to get the count ofclient1
. It will then have to ask every other machine for the total seen forclient1
. It can then aggregate the total. - Given that every
client_id
has now been aggregated, each machine reports to the coordinator theclient_id
s that it was responsible for, as well as the count of rows seen for each. The coordinator is responsible for returning the result of the query to the client, which in our example is STMO.
A similar process happens on data joins, where different machines are told to join on different keys. In that case, data from both tables needs to be shuffled to every machine.
Why do we have multiple databases? Why not use Athena for everything?
Great question! Presto is something we control, and can upgrade it at-will. Athena is currently a serverless version of Presto, and as such doesn't have all of the bells and whistles. For example, it doesn't support lambda expressions or UDFs, the latter of which we use in the Client Count Daily dataset.
Key Takeaways
- Use Athena, since it doesn't have the resource constraints that
Presto
orSpark
do. - Use
LIMIT
. At the end of a query all the data needs to be sent to a single machine, usingLIMIT
will reduce that amount and possible prevent an out of memory situation. - Use approximate algorithms. These mean less data needs to be shuffled, since we can use probabilistic data structures instead of the raw data itself.
- Specify large tables first in a
JOIN
operation. In this case, small tables can be sent to every machine, eliminating one data shuffle operation. Note that Spark supports abroadcast
command explicitly.
How is the data stored?
The data is stored in columnar format. Let's try and understand that with an example.
Traditional Row Stores
Consider a completely normal CSV file, which is actually an example of a row store.
name,age,height
"Ted",27,6.0
"Emmanuel",45,5.9
"Cadence",5,3.5
When this data is stored to disk, you could read an entire record in a consecutive order. For example if
the first "
was stored at block 1 on disk, then a sequential scan from 1 will give the first row of
data: "ted",27,6.0
. Keep scanning and you'll get \n"Emm
... and so on.
So for the above, the following query will be fast:
SELECT *
FROM people
WHERE name == 'Ted'
Since the database can just scan the first row of data. However, the following is more difficult:
SELECT name
FROM people
Now the database has to read all of the rows, and then pick out the name
column. This is a lot
more overhead!
Columnar Stores
Columnar turns the data sideways. For example, we can make a columnar version of the above data, and still store it in CSV:
name,"Ted","Emmanuel","Cadence"
age,27,45,5
height,6.0,5.9,3.5
Pretty easy! Now let's consider how we can query the data when it's stored this way.
SELECT *
FROM people
WHERE name == "ted"
This query is pretty hard! We have to read all of the data now, because the
(name, age, height)
isn't stored together.
Now let's consider our other query:
SELECT name
FROM people
Suddenly, this is easy! We don't have to check in as many places for data, we can just read the first few blocks of disks sequentially.
Data Partitions
We can improve performance even further by taking advantage of partitions. These are entire files of data
that share a value for a column. So for example, if everyone in the people
table lived in DE
, then we
could add that to the filename: /country=DE/people.csv
.
From there, our query engine would have to know how to read that path, and understand that it's telling us that all of those people share a country. So if we were to query for this:
SELECT *
FROM people
WHERE country == 'US'
The database wouldn't have to even read the file! It could just look at the path and realize there was nothing of interest.
Our tables are often partitioned by date, e.g. submission_date_s3
.
Key Takeaways
- Limit queries to a specific few columns you need, to reduce the amount of data that has to be read
- Filter on partitions to prune down the data you need
Getting Analysis Review
The current process for getting analysis review has not been optimized.
If you run into problems, feel free to ask in the #datapipeline
channel on IRC.
Table of Contents
Finding a Reviewer
There are two major types of review for most analyses: Data review and Methodology review.
A data reviewer has expert understanding of the dataset your analysis is based upon. During data review, your reviewer will try to identify any issues with your analysis that come from misapplying the data. For example, sampling issues or data collection oddities.
Methodology review primarily covers statistical concepts. For example, if you're training regressions, forecasting data, or doing complex statistical tests, you should probably get Methodology review. Many (most?) analyses at Mozilla use simple aggregations and little to no statistical inference. These analyses only require data review.
Accordingly, we suggest you first find a data reviewer for your analysis. Your data reviewer will tell you if you should get methodology review and will help you find a reviewer.
Data Reviewers
To get review for your analysis, contact one of the data reviewers listed for the dataset your analysis is based upon:
Dataset | Reviewers |
---|---|
Longitudinal | |
Cross Sectional | |
Main Summary | |
Crash Summary | |
Crash Aggregate | |
Events | |
Sync Summary | |
Addons | |
Client Count Daily |
Jupyter Notebooks
It's difficult to review Jupyter notebooks on Github. The notebook is stored as a JSON object, with python code stored in strings. This makes commenting on particular lines very difficult. Because the output is stored next to the code, it's also very difficult to review the diff for a given notebook.
RTMO
For simple notebooks, the best way to get review is through the knowledge repo. The knowledge repo renders your Jupyter notebooks to markdown while keeping the original source code next to the document. This gives your reviewer an easy-to-review markdown file. Your analysis will also be available at RTMO. This will aid discoverability and help others find your analysis. Note that RTMO is public so your analysis will also be public.
Start a Repository
Notebooks are great for presenting information, but are not a good medium for storing code. If your notebook contains complicated logic or a significant amount of custom code, you should consider moving most of the logic to a python package. Your reviewer may ask you for tests, which also requires moving the code out of a notebook.
Where to store the package
The data platform team maintains a python repository of analyses and ETL called
python_mozetl
.
Feel free to file a pull request against that repository.
Your reviewer will provide analysis review during the code review.
You'll need review for each commit to python_mozetl
's master branch.
If you are still prototyping your job but want to move out of a Jupyter notebook,
take a look at
cookiecutter-python-etl
.
This tool will help you configure a new python repository
so you can hack quickly without getting review.
nbconvert
The easiest way to get your code out of a Jupyter notebook
is to use nbconvert
command from Jupyter.
If you have a notebook called analysis.ipynb
you can dump your analysis to a python script using the following command:
jupyter nbconvert --to python analysis.ipynb
You'll need to clean up some formatting in the resulting analysis.py
file,
but most of the work has been done for you.
Collecting New Data
Guidelines
For information about what sorts of data may be collected, and for information on getting a data collection request reviewed, please read the Data Collection Guidelines.
Mechanics
The mechanics of how to instrument new data collection in Firefox are covered in Adding a new Telemetry probe.
For non-Telemetry data collection, we have a mechanism for streamlining ingestion of structured (JSON) data that utilizes the same underlying infrastructure. See this cookbook for details on using it.
Tools
This section describes tools we recommend using to analyze Firefox data.
Projects
Below are a number of trailheads that lead into the projects and code that comprise the Firefox Data Platform.
Telemetry APIs
Name and repo | Description |
---|---|
python_moztelemetry | Python APIs for Mozilla Telemetry |
moztelemetry | Scala APIs for Mozilla Telemetry |
spark-hyperloglog | Algebird's HyperLogLog support for Apache Spark |
ETL code and Datasets
Name and repo | Description |
---|---|
telemetry-batch-view | Scala ETL code for derived datasets |
python_mozetl | Python ETL code for derived datasets |
telemetry-airflow | Airflow configuration and DAGs for scheduled jobs |
python_mozaggregator | Aggregation job for telemetry.mozilla.org aggregates |
telemetry-streaming | Spark Streaming ETL jobs for Mozilla Telemetry |
See also firefox-data-docs
for documentation on datasets.
Infrastructure
Name and repo | Description |
---|---|
mozilla-pipeline-schemas | JSON and Parquet Schemas for Mozilla Telemetry and other structured data |
hindsight | Real-time data processing |
lua_sandbox | Generic sandbox for safe data analysis |
lua_sandbox_extensions | Modules and packages that extend the Lua sandbox |
nginx_moz_ingest | Nginx module for Telemetry data ingestion |
puppet-config | Cloud services puppet config for deploying infrastructure |
parquet2hive | Hive import statement generator for Parquet datasets |
EMR Bootstrap scripts
Name and repo | Description |
---|---|
emr-bootstrap-spark | AWS bootstrap scripts for Spark. |
emr-bootstrap-presto | AWS bootstrap scripts for Presto. |
Data applications
Name and repo | Description |
---|---|
telemetry.mozilla.org | Main entry point for viewing aggregate Telemetry data |
Cerberus & Medusa | Automatic alert system for telemetry aggregates |
analysis.t.m.o | Self serve data analysis platform |
Mission Control | Low latency dashboard for stability and health metrics |
Experiments Viewer | Visualization for Shield experiment results |
Re:dash | Mozilla's fork of the data query / visualization system |
TAAR | Telemetry-aware addon recommender |
Ensemble | A minimalist platform for publishing data |
Hardware Report | Firefox Hardware Report, available here |
python-zeppelin | Convert Zeppelin notebooks to Markdown |
St. Mocli | A command-line interface to STMO |
probe-scraper | Scrape and publish Telemetry probe data from Firefox |
test-tube | Compare data across branches in experiments |
experimenter | A web application for managing experiments |
St. Moab | Automatically generate Re:dash dashboard for A/B experiments |
Reference materials
Public
Name and repo | Description |
---|---|
firefox-data-docs | All the info you need to answer questions about Firefox users with data |
Firefox source docs | Mozilla Source Tree Docs - Telemetry section |
reports.t.m.o | Knowledge repository for public reports |
Non-public
Name and repo | Description |
---|---|
Fx-Data-Planning | Quarterly goals and internal documentation |
An overview of Mozilla’s Data Pipeline
This post describes the architecture of Mozilla’s data pipeline, which is used to collect Telemetry data from our users and logs from various services. One of the cool perks of working at Mozilla is that most of what we do is out in the open and because of that I can do more than just show you some diagram with arrows of our architecture; I can point you to the code, script & configuration that underlies it!
To make the examples concrete, the following description is centered around the collection of Telemetry data. The same tool-chain is used to collect, store and analyze data coming from disparate sources though, such as service logs.
style firefox fill:#f61 style elb fill:#777 style nginx fill:green style landfill fill:tomato style datalake fill:tomato style kafka fill:#aaa style cep fill:palegoldenrod style dwl fill:palegoldenrod style hsui fill:palegoldenrod style prestodb fill:cornflowerblue style redash fill:salmon style spark fill:darkorange style airflow fill:lawngreen style rdbms fill:cornflowerblue style tmo fill:lightgrey style cerberus fill:royalblue
Firefox
There are different APIs and formats to collect data in Firefox, all suiting different use cases:
- histograms – for recording multiple data points;
- scalars – for recording single values;
- timings – for measuring how long operations take;
- events – for recording time-stamped events.
These are commonly referred to as probes. Each probe must declare the collection policy it conforms to: either release or prerelease. When adding a new measurement data-reviewers carefully inspect the probe and eventually approve the requested collection policy:
- Release data is collected from all Firefox users.
- Prerelease data is collected from users on Firefox Nightly and Beta channels.
Users may choose to turn the data collection off in preferences.
A session begins when Firefox starts up and ends when it shuts down. As a session could be long-running and last weeks, it gets sliced into smaller logical units called subsessions. Each subsession generates a batch of data containing the current state of all probes collected so far, i.e. a main ping, which is sent to our servers. The main ping is just one of the many ping types we support. Developers can create their own ping types if needed.
Pings are submitted via an API that performs a HTTP POST request to our edge servers. If a ping fails to successfully submit (e.g. because of missing internet connection), Firefox will store the ping on disk and retry to send it until the maximum ping age is exceeded.
Kafka
HTTP submissions coming in from the wild hit a load balancer and then an NGINX module. The module accepts data via a HTTP request which it wraps in a Hindsight protobuf message and forwards to two places: a Kafka cluster and a short-lived S3 bucket (landfill) which acts as a fail-safe in case there is a processing error and/or data loss within the rest of the pipeline. The deployment scripts and configuration files of NGINX and Kafka live in a private repository.
The data from Kafka is read from the Complex Event Processors (CEP) and the Data Warehouse Loader (DWL), both of which use Hindsight.
Hindsight
Hindsight, an open source stream processing software system developed by Mozilla as Heka’s successor, is useful for a wide variety of different tasks, such as:
- converting data from one format to another;
- shipping data from one location to another;
- performing real time analysis, graphing, and anomaly detection.
Hindsight’s core is a lightweight data processing kernel written in C that controls a set of Lua plugins executed inside a sandbox.
The CEP are custom plugins that are created, configured and deployed from an UI which produce real-time plots like the number of pings matching a certain criteria. Mozilla employees can access the UI and create/deploy their own custom plugin in real-time without interfering with other plugins running.
The DWL is composed of a set of plugins that transform, convert & finally shovel pings into S3 for long term storage. In the specific case of Telemetry data, an input plugin reads pings from Kafka, pre-processes them and sends batches to S3, our data lake, for long term storage. The data is compressed and partitioned by a set of dimensions, like date and application.
The data has traditionally been serialized to Protobuf sequence files which contain some nasty “free-form” JSON fields. Hindsight gained recently the ability to dump data directly in Parquet form though.
The deployment scripts and configuration files of the CEP & DWL live in a private repository.
Spark
Once the data reaches our data lake on S3 it can be processed with Spark. We provide a portal (ATMO) that allows Mozilla employees to create their own Spark cluster pre-loaded with a set of libraries & tools, like Jupyter, NumPy, SciPy, Pandas etc., and an API to conveniently read data stored in Protobuf form on S3 in a Spark RDD using a ORM-like interface. Behind the scenes we use EMR to create Spark clusters, which are then monitored by ATMO.
ATMO is mainly used to write custom analyses; since our users aren’t necessary data engineers/scientists we chose Python as the main supported language to interface with Spark. From ATMO it’s also possible to schedule periodic notebook runs and inspect the results from a web UI.
As mentioned earlier, most of our data lake contains data serialized to Protobuf with free-form JSON fields. Needless to say, parsing JSON is terribly slow when ingesting Terabytes of data per day. A set of ETL jobs, written in Scala by Data Engineers and scheduled with Airflow, create Parquet views of our raw data. We have a Github repository telemetry-batch-view that showcases this.
Aggregates Dataset
subgraph mozaggregator service(service) aggregator rdbms(fa:fa-database PostgreSQL) end
pipeline --> aggregator pipeline --> spark{fa:fa-star Spark} pipeline --> redash[fa:fa-line-chart Re:dash]
subgraph telemetry.mozilla.org telemetry.js(telemetry.js) --> dist telemetry.js --> evo orphan[Update Orphaning] crashdash[tmo/crashes] end
redash --> crashdash service --> telemetry.js spark --> orphan
telemetry.js --> telemetry-next-node(telemetry-next-node) subgraph alerts.tmo cerberus[fa:fa-search-plus Cerberus] -->medusa medusa --> html medusa --> email end
telemetry-next-node --> cerberus
style redash fill:salmon style spark fill:darkorange style rdbms fill:cornflowerblue style cerberus fill:royalblue style firefox fill:#f61 style fennec fill:#f61 style telemetry.js fill:lightgrey style dist fill:lightgrey style evo fill:lightgrey
A dedicated Spark job feeds daily aggregates to a PostgreSQL database which powers a HTTP service to easily retrieve faceted roll-ups. The service is mainly used by TMO, a dashboard that visualizes distributions and time-series, and Cerberus, an anomaly detection tool that detects and alerts developers of changes in the distributions. Originally the sole purpose of the Telemetry pipeline was to feed data into this dashboard but in time its scope and flexibility grew to support more general use-cases.
Presto & re:dash
We maintain a couple of Presto clusters and a centralized Hive metastore to query Parquet data with SQL. The Hive metastore provides an universal view of our Parquet dataset to both Spark and Presto clusters.
Presto, and other databases, are behind a re:dash service (STMO) which provides a convenient & powerful interface to query SQL engines and build dashboards that can be shared within the company. Mozilla maintains its own fork of re:dash to iterate quickly on new features, but as good open source citizen we push our changes upstream.
Is that it?
No, not really. If you want to read more, check out this article. For example, the DWL pushes some of the Telemetry data to Redshift and other tools that satisfy more niche needs. The pipeline ingests logs from services as well and there are many specialized dashboards out there I haven’t mentioned. We also use Zeppelin as a means to create interactive data analysis notebooks that supports Spark, SQL, Scala and more.
There is a vast ecosystem of tools for processing data at scale, each with their pros & cons. The pipeline grew organically and we added new tools as new use-cases came up that we couldn’t solve with our existing stack. There are still scars left from that growth though which require some effort to get rid of, like ingesting data from schema-less format.
A Detailed Look at the Data Platform
For a more gentle introduction to the data platform, please read the Pipeline Overview article.
This article goes into more depth about the architecture and flow of data in the platform.
The Entire Platform
The full detail of the platform can get quite complex, but at a high level the structure is fairly simple.
Each of these high-level parts of the platform are described in more detail below.
Data Producers
By far most data handled by the Data Platform is produced by Firefox. There are other producers, though, and the eventual aim is to generalize data production using a client SDK or set of standard tools.
Most data is submitted via HTTP POST, but data is also produced in the form of service logs and statsd
messages.
If you would like to locally test a new data producer, the gzipServer
project provides a simplified server that makes it easy to inspect submitted messages.
Ingestion
lb --> tee tee --> mozingest mozingest --> kafka_unvalidated mozingest --> Landfill kafka_unvalidated --> dwl[Data Store Loader] kafka_validated --> cep[Hindsight CEP] kafka_validated --> sparkstreaming[Spark Streaming] Schemas -.->|validation| dwl dwl --> kafka_validated dwl --> s3_heka dwl --> s3_parquet sparkstreaming --> s3_parquet
Data arrives as an HTTP POST of an optionally gzipped payload of JSON. See the common Edge Server specification for details.
Submissions hit a load balancer which handles the SSL connection, then forwards to a "tee" server, which may direct some or all submissions to alternate backends. In the past, the tee was used to manage the cutover between different versions of the backend infrastructure. It is implemented as an OpenResty
plugin.
From there, the mozingest
HTTP Server receives submissions from the tee and batches and stores data durably on Amazon S3 as a fail-safe (we call this "Landfill"). Data is then passed along via Kafka for validation and further processing. If there is a problem with decoding, validation, or any of the code described in the rest of this section, data can be re-processed from this fail-safe store. The mozingest
server is implemented as an nginx
module.
Validation, at a minimum, ensures that a payload is valid JSON (possibly compressed). Many document types also have a JSONSchema specification, and are further validated against that.
Invalid messages are redirected to a separate "errors" stream for debugging and inspection.
Valid messages proceed for further decoding and processing. This involves things like doing GeoIP lookup and discarding the IP address, and attaching some HTTP header info as annotated metadata.
Validated and annotated messages become available for stream processing.
They are also batched and stored durably for later batch processing and ad-hoc querying.
See also the "generic ingestion" proposal which aims to make ingestion, validation, storage, and querying available as self-serve for platform users.
Data flow for valid submissions
Fx->>lb: HTTPS POST
lb->>mi: forward
mi-->>lf: failsafe store
mi->>k: enqueue
k->>dwl: validate, decode
dwl->>k: enqueue validated
dwl->>dl: store durably
Other ingestion methods
Hindsight is used for ingestion of logs from applications and services, it supports parsing of log lines and appending similar metadata as the HTTP ingestion above (timestamp, source, and so on).
Statsd
messages are ingested in the usual way.
Storage
Ingestion --> s3_heka Ingestion --> s3_parquet Ingestion --> landfill Ingestion -.-> stream[Stream Processing] stream --> s3_parquet batch[Batch Processing] --> s3_parquet batch --> PostgreSQL batch --> DynamoDB batch --> s3_public selfserve[Self Serve] --> s3_analysis s3_analysis --> selfserve Hive -->|Presto| redash[Re:dash] PostgreSQL --> redash Redshift --> redash MySQL --> redash BigQuery --> redash
s3_parquet -.- Hive
Amazon S3 forms the backbone of the platform storage layer. The primary format used in the Data Lake is parquet, which is a strongly typed columnar storage format that can easily be read and written by Spark, as well as being compatible with SQL interfaces such as Hive and Presto. Some data is also stored in Heka-framed protobuf format. This custom format is usually reserved for data where we do not have a complete JSONSchema specification.
Using S3 for storage avoids the need for an always-on cluster, which means that data at rest is inexpensive. S3 also makes it very easy to automatically expire (delete) objects after a certain period of time, which is helpful for implementing data retention policies.
Once written to S3, the data is typically treated as immutable - data is not appended to existing files, nor is data normally updated in place. The exception here is when data is back-filled, in which case previous data may be overwritten.
There are a number of other types of storage used for more specialized applications, including relational databases (such as PostgreSQL for the Telemetry Aggregates) and NoSQL databases (DynamoDB is used for a backing store for the TAAR project). Reading data from a variety of RDBMS sources is also supported via Re:dash.
The data stored in Heka format is readable from Spark using libraries in Scala or Python.
Parquet data can be read and written natively from Spark, and many datasets are indexed in a Hive Metastore, making them available through a SQL interface on Re:dash and in notebooks via Spark SQL. Many other SQL data sources are also made available via Re:dash, see this article for more information on accessing data using SQL.
There is a separate data store for self-serve Analysis Outputs, intended to keep ad-hoc, temporary data out of the Data Lake. This is implemented as a separate S3 location, with personal output locations prefixed with each person's user id, similar to the layout of the /home
directory on a Unix system. See the Working with Parquet data cookbook for more details.
Analysis outputs can also be made public using the Public Outputs bucket. This is a web-accessible S3 location for powering public dashboards. This public data is available at https://analysis-output.telemetry.mozilla.org/<job name>/data/<files>
.
Stream Processing
Stream processing is done using Hindsight and Spark Streaming.
Hindsight allows you to run plugins written in Lua inside a sandbox. This gives a safe, performant way to do self-serve streaming analysis. See this article for an introduction. Hindsight plugins do the initial data validation and decoding, as well as writing out to long-term storage in both Heka-framed protobuf and parquet forms.
Spark Streaming is used to read from Kafka and perform low-latency ETL and aggregation tasks. These aggregates are currently used by Mission Control and are also available for querying via Re:dash.
Batch Processing
Batch processing is done using Spark. Production ETL code is written in both Python and Scala.
There are Python and Scala libraries for reading data from the Data Lake in Heka-framed protobuf form, though it is much easier and more performant to make use of a derived dataset whenever possible.
Datasets in parquet format can be read natively by Spark, either using Spark SQL or by reading data directly from S3.
Data produced by production jobs go into the Data Lake, while output from ad-hoc jobs go into Analysis Outputs.
Job scheduling and dependency management is done using Airflow. Most jobs run once a day, processing data from "yesterday" on each run. A typical job launches a cluster, which fetches the specified ETL code as part of its bootstrap on startup, runs the ETL code, then shuts down upon completion. If something goes wrong, a job may time out or fail, and in this case it is retried automatically.
Self Serve Data Analysis
Most of the data analysis tooling has been developed with the goal of being "self-serve". This means that people should be able to access and analyze data on their own, without involving data engineers or operations. Thus can data access scale beyond a small set of people with specialized knowledge of the entire pipeline.
The use of these self-serve tools is described in the Getting Started article. This section focuses on how these tools integrate with the platform infrastructure.
ATMO: Spark Analysis
ATMO is a service for managing Spark clusters for data analysis. Clusters can be launched on demand, or can be scheduled to run a job on an ongoing basis. These clusters can read from the Data Lake described in the Storage section above, and can write results to either public (web-accessible) or private output locations.
Jupyter or Zeppelin notebooks are the usual interface to getting work done using a cluster, though you get full SSH access to the cluster.
Clusters launched via ATMO are automatically killed after a user-defined period of time (by default, 8 hours), though their lifetime can be extended as needed. Each cluster has a user-specific EFS
volume mounted on the /home/hadoop
directory, which means that data stored locally on the cluster persists from one cluster to the next. This volume is shared by all clusters launched by a given ATMO user.
STMO: SQL Analysis
STMO is a customized Re:dash installation that provides self-serve access to a a variety of different datasets. From here, you can query data in the Parquet Data Lake as well as various RDBMS data sources.
STMO interfaces with the data lake using both Presto and Amazon Athena. Each has its own data source in Re:dash. Since Athena does not support user-defined functions, datasets with HyperLogLog columns, such as client_count_daily
, are only available via Presto..
Different Data Sources in STMO connect to different backends, and each backend might use a slightly different flavor of SQL. You should find a link to the documentation for the expected SQL variant next to the Data Sources list.
Queries can be run just once, or scheduled to run periodically to keep data up-to-date.
There is a command-line interface to STMO called St. Mocli, if you prefer writing SQL using your own editor and tools.
Databricks: Managed Spark Analysis
Our Databricks instance (see Databricks docs) offers another notebook interface for doing analysis in Scala, SQL, Python and R.
Databricks provides an always-on shared server which is nice for quick data investigations. ATMO clusters take some time to start up, usually on the order of tens of minutes. The shared server allows you to avoid this start-up cost. Prefer ATMO for heavy analyses since you will have dedicated resources.
TMO: Aggregate Graphs
TMO provides easy visualizations of histogram and scalar measures over time. Time can be in terms of either builds or submission dates. This is the most convenient interface to the Telemetry data, as it does not require any custom code.
Visualization
There are a number of visualization libraries and tools being used to display data.
TMO Dashboards
The landing page at telemetry.mozilla.org
is a good place to look for existing graphs, notably the measurement dashboard which gives a lot of information about histogram and scalar measures collected on pre-release channels.
Notebooks
Use of interactive notebooks has become a standard in the industry, and Mozilla makes heavy use of this approach. ATMO makes it easy to run, share, and schedule Jupyter and Zeppelin notebooks.
Databricks notebooks are also an option.
Others
Re:dash lets you query the data using SQL, but it also supports a number of useful visualizations.
Hindsight's web interface has the ability to visualize time-series data.
Mission Control gives a low-latency view into release health.
Many bespoke visualizations are built using the Metrics Graphics library as a display layer.
Monitoring and Alerting
There are multiple layers of monitoring and alerting.
At a low level, the system is monitored to ensure that it is functioning as expected. This includes things like machine-level resources (network capacity, disk space, available RAM, CPU load) which are typically monitored using DataDog.
Next, we monitor the "transport" functionality of the system. This includes monitoring incoming submission rates, payload sizes, traffic patterns, schema validation failure rates, and alerting if anomalies are detected. This type of anomaly detection and alerting is handled by Hindsight.
Once data has been safely ingested and stored, we run some automatic regression detection on all Telemetry histogram measures using Cerberus. This code looks for changes in the distribution of a measure, and emails probe owners if a significant change is observed.
Production ETL jobs are run via Airflow, which monitors batch job progress and alerts if there are failures in any job. Self-serve batch jobs run via ATMO also generate alerts upon failure.
Scheduled Re:dash queries may also be configured to generate alerts, which is used to monitor the last-mile user facing status of derived datasets. Re:dash may also be used to monitor and alert on high-level characteristics of the data, or really anything you can think of.
Data Exports
Data is exported from the pipeline to a few other tools and systems. Examples include integration with Amplitude for mobile and product analytics, publishing reports and visualizations to the Mozilla Data Collective, and shipping data to other parts of the Mozilla organization.
There are also a few data sets which are made publicly available, such as the Firefox Hardware Report.
Bringing it all together
Finally, here is a more detailed view of the entire platform. Some connections are omitted for clarity.
HTTP Edge Server Specification
This document specifies the behavior of the server that accepts submissions from any HTTP client e.g. Firefox telemetry.
The original implementation of the HTTP Edge Server was tracked in Bug 1129222.
General Data Flow
HTTP submissions come in from the wild, hit a load balancer, then optionally an Nginx proxy, then the HTTP Edge Server described in this document. Data is accepted via a POST/PUT request from clients, which the server will wrap in a Heka message and forward to two places: the Services Data Pipeline, where any further processing, analysis, and storage will be handled; as well as to a short-lived S3 bucket which will act as a fail-safe in case there is a processing error and/or data loss within the main Data Pipeline.
Namespaces
Namespaces are used to control the processing of data from different types of clients, from the metadata that is collected to the destinations where the data is written, processed and accessible. Namespaces are configured in Nginx using a location directive, to request a new namespace file a bug against the Data Platform Team with a short description of what the namespace will be used for and the desired configuration options. Data sent to a namespace that is not specifically configured is assumed to be in the non-Telemetry JSON format described here.
Forwarding to the pipeline
The constructed Heka protobuf message to is written to disk and the pub/sub pipeline (currently Kafka). The messages written to disk serve as a fail-safe, they are batched and written to S3 (landfill) when they reach a certain size or timeout.
Edge Server Heka Message Schema
- required binary
Uuid
; // Internal identifier randomly generated - required int64
Timestamp
; // Submission time (server clock) - required string
Hostname
; // Hostname of the edge server e.g.ip-172-31-2-68
- required string
Type
; // Kafka topic name e.g.telemetry-raw
- required group
Fields
- required string
uri
; // Submission URI e.g./submit/telemetry/6c49ec73-4350-45a0-9c8a-6c8f5aded0cf/main/Firefox/58.0.2/release/20180206200532
- required binary
content
; // POST Body - required string
protocol
; // e.g.HTTP/1.1
- optional string
args
; // Query parameters e.g.v=4
- optional string
remote_addr
; // In our setup it is usually a load balancer e.g.172.31.32.229
- // HTTP Headers specified in the production edge server configuration
- optional string
Content-Length
; // e.g.4722
- optional string
Date
; // e.g.Mon, 12 Mar 2018 00:02:18 GMT
- optional string
DNT
; // e.g.1
- optional string
Host
; // e.g.incoming.telemetry.mozilla.org
- optional string
User-Agent
; // e.g.pingsender/1.0
- optional string
X-Forwarded-For
; // Last entry is treated as the client IP for geoIP lookup e.g.10.98.132.74, 103.3.237.12
- optional string
X-PingSender-Version
;// e.g.1.0
- required string
Server Request/Response
GET Request
Accept GET on /status
, returning OK
if all is well. This can be used to
check the health of web servers.
GET Response codes
- 200 - OK.
/status
and all’s well - 404 - Any GET other than
/status
- 500 - All is not well
POST/PUT Request
Treat POST and PUT the same. Accept POST or PUT to URLs of the form
^/submit/namespace/[id[/dimensions]]$
Example Telemetry format:
/submit/telemetry/docId/docType/appName/appVersion/appUpdateChannel/appBuildID
Specific Telemetry example:
/submit/telemetry/ce39b608-f595-4c69-b6a6-f7a436604648/main/Firefox/61.0a1/nightly/20180328030202
Example non-Telemetry format:
/submit/namespace/doctype/docversion/docid
Specific non-Telemetry example:
/submit/eng-workflow/hgpush/1/2c3a0767-d84a-4d02-8a92-fa54a3376049
Note that id
above is a unique document ID, which is used for de-duping
submissions. This is not intended to be the clientId
field from Telemetry.
If id
is omitted, we will not be able to de-dupe based on submission URLs. It
is recommended that id
be a UUID.
POST/PUT Response codes
- 200 - OK. Request accepted into the pipeline.
- 400 - Bad request, for example an un-encoded space in the URL.
- 404 - not found - POST/PUT to an unknown namespace
- 405 - wrong request type (anything other than POST/PUT)
- 411 - missing content-length header
- 413 - request body too large (Note that if we have badly-behaved clients that retry on
4XX
, we should send back 202 on body/path too long). - 414 - request path too long (See above)
- 500 - internal error
Other Considerations
Compression
It is not desirable to do decompression on the edge node. We want to pass along messages from the HTTP Edge node without "cracking the egg" of the payload.
We may also receive badly formed payloads, and we will want to track the incidence of such things within the main pipeline.
Bad Messages
Since the actual message is not examined by the edge server the only failures that occur are defined by the response status codes above. Messages are only forwarded to the pipeline when a response code of 200 is returned to the client.
GeoIP Lookups
No GeoIP lookup is performed by the edge server. If a client IP is available the the data warehouse loader performs the lookup and then discards the IP before the message hits long-term storage.
Data Retention
The edge server only stores data while batching and will have a retention time
of moz_ingest_landfill_roll_timeout
which is generally only a few minutes.
Retention time for the S3 landfill, pub/sub, and the data warehouse is outside
the scope of this document.
Event Data Pipeline
We collect event-oriented data from different sources. This data is collected and processed in a specific path through our data pipeline, which we will detail here.
style fx_code fill:#f94,stroke-width:0px style fx_extensions fill:#f94,stroke-width:0px style fx_hybrid fill:#f94,stroke-width:0px style mobile fill:#f94,stroke-width:0px style firefox fill:#f61,stroke-width:0px style mobile_telemetry fill:#f61,stroke-width:0px style pipeline fill:#79d,stroke-width:0px style main_summary fill:lightblue,stroke-width:0px style events_table fill:lightblue,stroke-width:0px style mobile_events_table fill:lightblue,stroke-width:0px style redash fill:salmon,stroke-width:0px style amplitude fill:salmon,stroke-width:0px
Overview
Across the different Firefox teams there is a common need for a more fine grained understanding of product usage, like understanding the order of interactions or how they occur over time. To address that our data pipeline needs to support working with event-oriented data.
We specify a common event data format, which allows for broader, shared usage of data processing tools. To make working with event data feasible, we provide different mechanisms to get the event data from products to our data pipeline and make the data available in tools for analysis.
The event format
Events are submitted as an array, e.g.:
[
[2147, "ui", "click", "back_button"],
[2213, "ui", "search", "search_bar", "google"],
[2892, "ui", "completion", "search_bar", "yahoo",
{"querylen": "7", "results": "23"}],
[5434, "dom", "load", "frame", null,
{"prot": "https", "src": "script"}],
// ...
]
Each event is of the form:
[timestamp, category, method, object, value, extra]
Where the individual fields are:
timestamp
:Number
, positive integer. This is the time in ms when the event was recorded, relative to the main process start time.category
:String
, identifier. The category is a group name for events and helps to avoid name conflicts.method
:String
, identifier. This describes the type of event that occurred, e.g.click
,keydown
orfocus
.object
:String
, identifier. This is the object the event occurred on, e.g.reload_button
orurlbar
.value
:String
, optional, may be null. This is a user defined value, providing context for the event.extra
:Object
, optional, may be null. This is an object of the form{"key": "value", ...}
, both keys and values need to be strings. This is used for events when additional richer context is needed.
See also the Firefox Telemetry documentation.
Event data collection
Firefox event collection
To collect this event data in Firefox there are different APIs in Firefox, all addressing different use cases:
- The Telemetry event API allows easy recording of events from Firefox code.
- The dynamic event API allows code from Mozilla addons to record new events into Telemetry without shipping Firefox code.
- The Telemetry extension API (work in progress) will allow Mozilla extensions to record new events into Telemetry.
- The Hybrid-content API allows specific white-listed Mozilla content code to record new events into Telemetry.
For all these APIs, events will get sent to the pipeline through the main ping, with a hard limit of 500 events per ping. In the future, Firefox events will be sent through a separate events ping, removing the hard limit. As of Firefox 61, all events recorded through these APIs are automatically counted in scalars.
Finally, custom pings can follow the event data format and potentially connect to the existing tooling with some integration work.
Mobile event collection
Mobile events data primarily flows through the mobile events ping (ping schema), from e.g. Firefox iOS, Firefox for Fire TV and Rocket.
Currently we also collect event data from Firefox Focus through the focus-events
ping,
using the telemetry-ios
and
telemetry-android
libraries.
Datasets
On the pipeline side, the event data is made available in different datasets:
main_summary
has a row for each main ping and includes its event payload.events
contains a row for each event received. See this sample query.telemetry_mobile_event_parquet
contains a row for each mobile event ping. See this sample query.focus_events_longitudinal
currently contains events from Firefox Focus.
Data tooling
The above datasets are all accessible through Re:dash and Spark jobs.
For product analytics based on event data, we have Amplitude (hosted by the IT data team). We can connect our event data sources data to Amplitude. We have an active connector to Amplitude for mobile events, which can push event data over daily. For Firefox Desktop events this will be available soon.
Mozilla Firefox Data Analysis Tools
This is a starting point for making sense of (and gaining access to) all of the Firefox-related data analysis tools. There are a number of different tools available, all with their own strengths, tailored to a variety of use cases and skill sets.
sql.telemetry.mozilla.org
(STMO)
The sql.telemetry.mozilla.org
(STMO) site
is an instance of the very fine Re:dash software, allowing
for SQL-based exploratory analysis and visualization / dashboard
construction. Requires (surprise!) familiarity with SQL, and for your data to
be explicitly exposed as an STMO data source. Bugs or feature requests can be
reported in our issue tracker.
analysis.telemetry.mozilla.org
(ATMO)
The
analysis.telemetry.mozilla.org
(ATMO) site can be used to launch and gain access to virtual machines running
Apache Spark clusters which have been pre-configured with access to the raw data
stored in our long term storage S3 buckets. Spark allows you to use
Python or Scala to perform arbitrary analysis and generate arbitrary
output. Once developed, ATMO can also be used to run recurring Spark jobs
for data transformation, processing, or reporting. Requires Python or Scala
programming skills and knowledge of various data APIs. Learn more by visiting
the documentation or
tutorials.
Databricks
Offers notebook interface with shared, always-on, autoscaling cluster
(attaching your notebooks to shared_serverless
is the best way to start).
Convenient for quick data investigations. Users can get help on #databricks
channel on IRC and are advised to join the
databricks-discuss@mozilla.com
group.
telemetry.mozilla.org
(TMO)
Our telemetry.mozilla.org
(TMO) site is the
'venerable standby' of Firefox telemetry analysis tools. It uses aggregate
telemetry data (as opposed to the collated data sets that are exposed to most
of the other tools) so it provides less latency than most but is unsuitable for
examining at the individual client level. It provides a powerful UI that allows
for sophisticated ad-hoc analysis without the need for any specialized
programming skills, but with so many options the UI can be a bit intimidating
for novice users.
Distribution Viewer
Distribution Viewer (deprecated) was a simple tool
that provides a set of cumulative distribution
graphs
for a pre-specified selection of Firefox user metrics. These metrics are
extracted from a 1% sample of the clientId
s from Firefox Telemetry. These plots
will allow you to understand how values of different metrics are spread out
among our population of users rather than just using a one number summary (such
as a mean or median). By viewing the entire distribution, you can get a sense
of the importance of behavior at the extremes as well as anomalies within the
population that might indicate interesting behavior. Very simple to use (no
programming required) and able to provide interesting insights, but not usually
suitable for ad-hoc analysis.
Real Time / CEP
The "real time" or "complex event processing" (CEP) system is part of the ingestion infrastructure that processes all of our Firefox telemetry data. It provides extremely low latency access to the data as it's flowing through our ingestion system on its way to long term storage. As a CEP system, it is unlike the rest of our analysis tools in that it is up to the analyst to specify and maintain state from the data that is flowing; it is non-trivial to revisit older data that has already passed through the system. The CEP is very powerful, allowing for sophisticated monitoring, alerting, reporting, and dashboarding. Developing new analysis plugins requires knowledge of the Lua programming language, relevant APIs, and a custom filter configuration syntax. Learn more about how to do this in our Creating a Real-time Analysis Plugin article.
Introduction
Apache Spark is a data processing engine designed to be fast and easy to use. We have setup Jupyter notebooks that use Spark to analyze our Telemetry data. Jupyter notebooks can be easily shared and updated among colleagues, and, when combined with Spark, enable richer analysis than SQL alone.
The Spark clusters can be launched from ATMO. The Spark Python API is called PySpark.
Note that this documentation focuses on ATMO, but analysis with Spark is also possible using Databricks. For more information please see this example notebook.
Setting Up a Spark Cluster On ATMO
- Go to https://analysis.telemetry.mozilla.org
- Click “Launch an ad-hoc Spark cluster”.
- Enter some details:
- The “Cluster Name” field should be a short descriptive name, like “chromehangs analysis”.
- Set the number of workers for the cluster. Please keep in mind to use resources sparingly; use a single worker to write and debug your job.
- Upload your SSH public key.
- Click “Submit”.
- A cluster will be launched on AWS pre-configured with Spark, Jupyter
and some handy data analysis libraries like
pandas
andmatplotlib
.
Once the cluster is ready, you can tunnel Jupyter through SSH by following the instructions on the dashboard. For example:
ssh -i ~/.ssh/id_rsa -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com
Finally, you can launch Jupyter in Firefox by visiting http://localhost:8888.
The Python Jupyter Notebook
When you access http://localhost:8888, two example Jupyter notebooks are available to peruse.
Starting out, we recommend looking through the Telemetry Hello World notebook. It gives a nice overview of Jupyter and analyzing telemetry data using PySpark and the RDD API.
Using Jupyter
Jupyter Notebooks contain a series of cells. Each cell contains code or markdown. To switch between the two, use the drop-down at the top. To run a cell, use shift-enter; this either compiles the markdown or runs the code. To create new cell, select Insert -> Insert Cell Below.
A cell can output text or plots.
To output plots inlined with the cell,
run %pylab inline
, usually below your import statements:
The notebook is setup to work with Spark. See the "Using Spark" section for more information.
Schedule a periodic job
Scheduled Spark jobs allow a Jupyter notebook to be updated consistently, making a nice and easy-to-use dashboard.
To schedule a Spark job:
- Visit the analysis provisioning dashboard at https://analysis.telemetry.mozilla.org and sign in
- Click “Schedule a Spark Job”
- Enter some details:
- The “Job Name” field should be a short descriptive name, like “chromehangs analysis”.
- Upload your Jupyter notebook containing the analysis.
- Set the number of workers of the cluster in the “Cluster Size” field.
- Set a schedule frequency using the remaining fields.
Now, the notebook will be updated automatically and the results can be easily shared. Furthermore, all files stored in the notebook's local working directory at the end of the job will be automatically uploaded to S3, which comes in handy for simple ETL workloads for example.
For reference, see Simple Dashboard with Scheduled Spark Jobs and Plotly.
Sharing a Notebook
Jupyter notebooks can be shared in a few different ways.
Sharing a Static Notebook
An easy way to share is using a gist on Github.
- Download file as
.ipynb
- Upload to a gist on
gist.github.com
- Enter the gist URL at Jupyter nbviewer
- Share with your colleagues!
Sharing a Scheduled Notebook
Setup your scheduled notebook. After it's run, do the following:
- Go to the "Schedule a Spark job" tab in ATMO
- Get the URL for the notebook (under 'Currently Scheduled Jobs')
- Paste that URL into Jupyter nbviewer
Zeppelin Notebooks
We also have support for Apache Zeppelin notebooks. The notebook server for that is running on port 8890, so you can connect to it just by tunnelling the port (instead of port 8888 for Jupyter). For example:
ssh -i \~/.ssh/id\_rsa -L 8890:localhost:8890
hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com
Using Spark
Spark is a general-purpose cluster computing system - it allows users to run general execution graphs. APIs are available in Python, Scala, and Java. The Jupyter notebook utilizes the Python API. In a nutshell, it provides a way to run functional code (e.g. map, reduce, etc.) on large, distributed data.
Check out Spark Best Practices for tips on using Spark to its full capabilities.
SparkContext
(sc
)
Access to the Spark API is provided through SparkContext
. In the Jupyter
notebook, this is the sc
object. For example, to create a
distributed RDD of monotonically increasing numbers 1-1000:
numbers = range(1000)
# no need to initialize sc in the Jupyter notebook
numsRdd = sc.parallelize(numbers)
nums.take(10) #no guaranteed order
Spark RDD
The Resilient Distributed Dataset (RDD) is Spark's basic data structure. The operations that are performed on these structures are distributed to the cluster. Only certain actions (such as collect() or take(N)) pull an RDD in locally.
RDD's are nice because there is no imposed schema - whatever they contain, they distribute around the cluster. Additionally, RDD's can be cached in memory, which can greatly improve performance of some algorithms that need access to data over and over again.
Additionally, RDD operations are all part of a directed, acyclic graph. This gives increased redundancy, since Spark is always able to recreate an RDD from the base data (by rerunning the graph), but also provides lazy evaluation. No computation is performed while an RDD is just being transformed (a la map), but when an action is taken (e.g. reduce, take) the entire computation graph is evaluated. Continuing from our previous example, the following gives some of the peaks of a sin wave:
import numpy as np
#no computation is performed on the following line!
sin_values = numsRdd.map(lambda x : np.float(x) / 10).map(lambda x : (x, np.sin(x)))
#now the entire computation graph is evaluated
sin_values.takeOrdered(5, lambda x : -x[1])
For jumping into working with Spark RDD's, we recommend reading the Spark Programming Guide.
Spark SQL and Spark DataFrames/Datasets
Spark also supports traditional SQL, along with special data structures
that require schemas. The Spark SQL API can be accessed with the
spark
object. For example:
longitudinal = spark.sql('SELECT * FROM longitudinal')
creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R DataFrames. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a SQL-like way to get the count of a specific OS:
longitudinal.select("os").where("os = 'Darwin'").count()
To Transform the DataFrame object to an RDD, simply do:
longitudinal_rdd = longitudinal.rdd
In general, however, the DataFrames are performance optimized, so it's worth the effort to learn the DataFrame API.
For more overview, see the SQL Programming Guide. See also the Longitudinal Tutorial, one of the available example notebooks when you start a cluster.
Available Data Sources for SparkSQL
For information about data sources available for querying (e.g. Longitudinal dataset), see Choosing a Dataset.
These datasets are optimized for fast access, and will far out-perform analysis on the raw Telemetry ping data.
Accessing the Spark UI
After establishing an SSH connection to the Spark cluster, go to https://localhost:8888/spark to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job.
The MozTelemetry
Library
We have provided a library that gives easy access to the raw telemetry ping data. For example usage, see the Telemetry Hello World example notebook. Detailed documentation for the library can be found at the Python MozTelemetry Documentation.
Using the Raw Ping Data
First off, import the moztelemetry
library using the following:
from moztelemetry.dataset import Dataset
The ping data is an RDD of JSON elements. For example, using the following:
pings = Dataset.from_source("telemetry") \
.where(docType='main') \
.where(submissionDate="20180101") \
.where(appUpdateChannel="nightly") \
.records(sc, sample=0.01)
returns an RDD of 1/100th of Firefox Nightly JSON pings submitted on from January 1 2018. Now, because it's JSON, pings are easy to access. For example, to get the count of each OS type:
os_names = pings.map(lambda x: (x['environment']['system']['os']['name'], 1))
os_counts = os_names.reduceByKey(lambda x, y: x + y)
os_counts.collect()
Alternatively, moztelemetry
provides the get_pings_properties
function, which will gather the data for you:
from moztelemetry import get_pings_properties
subset = get_pings_properties(pings, ["environment/system/os/name"])
subset.map(lambda x: (x["environment/system/os/name"], 1)).reduceByKey(lambda x, y: x + y).collect()
FAQ
Please add more FAQ as questions are answered by you or for you.
How can I load parquet datasets in a Jupyter notebook?
Use spark.read.parquet
, e.g.:
dataset = spark.read.parquet("s3://the_bucket/the_prefix/the_version")`
For more information see Working with Parquet.
I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error
AWS recycles hostnames, so this warning is expected.
Removing the offending key from $HOME/.ssh/known_hosts
will remove the warning.
You can find the line to remove by finding the line in the output that says
Offending key in /path/to/hosts/known_hosts:2
Where 2 is the line number of the key that can be deleted. Just remove that line, save the file, and try again.
Why is my notebook hanging?
There are a few common causes for this:
- Currently, our Spark notebooks can only run a single Python kernel at a time. If you open multiple notebooks on the same cluster and try to run both, the second notebook will hang. Be sure to close notebooks using "Close and Halt" under the "File" drop-down.
- The connection from PySpark to the Spark driver might be lost. Unfortunately the best way to recover from this for the moment seems to be spinning up a new cluster.
- Cancelling execution of a notebook cell doesn't cancel any spark jobs
that might be running in the background. If your spark commands seem to
be hanging, try running
sc.cancelAllJobs()
.
How can I keep running after closing the notebook?
For long-running computation, it might be nice to close the notebook (and the SSH session) and look at the results later. Unfortunately, all cell output will be lost when a notebook is closed (for the running cell). To alleviate this, there are a few options:
- Have everything output to a variable. These values should still be available when you reconnect.
- Put %%capture at the beginning of the cell to store all output. See the documentation.
How do I load an external library into the cluster?
Assuming you've got a URL for the repo, you can create an egg for it this way:
!git clone `<repo url>` && cd `<repo-name>` && python setup.py bdist_egg`\
sc.addPyFile('`<repo-name>`/dist/my-egg-file.egg')`
Alternately, you could just create that egg locally, upload it to a web server, then download and install it:
import requests`\
r = requests.get('`<url-to-my-egg-file>`')`\
with open('mylibrary.egg', 'wb') as f:`\
f.write(r.content)`\
sc.addPyFile('mylibrary.egg')`
You will want to do this before you load the library. If the library is already loaded, restart the kernel in the Jupyter notebook.
SQL Style Guide
Table of Contents
- Consistency
- Reserved Words
- Variable Names
- Aliasing
- Left Align Root Keywords
- Code Blocks
- Parentheses
- Boolean at the Beginning of Line
- Nested Queries
- About this Document
Consistency
From Pep8:
A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is the most important.
However, know when to be inconsistent -- sometimes style guide recommendations just aren't applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!
Reserved Words
Always use uppercase for reserved keywords like SELECT
, WHERE
, or AS
.
Variable Names
- Use consistent and descriptive identifiers and names.
- Use lower case names with underscores, such as
first_name
. Do not use CamelCase. - Presto functions, such as
cardinality
,approx_distinct
, orsubstr
, are identifiers and should be treated like variable names. - Names must begin with a letter and may not end in an underscore.
- Only use letters, numbers, and underscores in variable names.
Aliasing
Always include the AS
keyword when aliasing a variable,
it's easier to read when explicit.
Good
SELECT
substr(submission_date, 1, 6) AS month
FROM
main_summary
LIMIT
10
Bad
SELECT
substr(submission_date, 1, 6) month
FROM
main_summary
LIMIT
10
Left Align Root Keywords
Root keywords should all start on the same character boundary. This is counter to the common "rivers" pattern described here.
Good:
SELECT
client_id,
submission_date
FROM
main_summary
WHERE
sample_id = '42'
AND submission_date > '20180101'
LIMIT
10
Bad:
SELECT client_id,
submission_date
FROM main_summary
WHERE sample_id = '42'
AND submission_date > '20180101'
Code Blocks
Root keywords should be on their own line. For example:
Good:
SELECT
client_id,
submission_date
FROM
main_summary
WHERE
submission_date > '20180101'
AND sample_id = '42'
LIMIT
10
It's acceptable to include an argument on the same line as the root keyword, if there is exactly one argument.
Acceptable:
SELECT
client_id,
submission_date
FROM main_summary
WHERE
submission_date > '20180101'
AND sample_id = '42'
LIMIT 10
Do not include multiple arguments on one line.
Bad:
SELECT client_id, submission_date
FROM main_summary
WHERE
submission_date > '20180101'
AND sample_id = '42'
LIMIT 10
Bad
SELECT
client_id,
submission_date
FROM main_summary
WHERE submission_date > '20180101'
AND sample_id = '42'
LIMIT 10
Parentheses
If parentheses span multiple lines:
- The opening parenthesis should terminate the line.
- The closing parenthesis should be lined up under the first character of the line that starts the multi-line construct.
- The contents of the parentheses should be indented one level.
For example:
Good
WITH sample AS (
SELECT
client_id,
FROM
main_summary
WHERE
sample_id = '42'
)
Bad (Terminating parenthesis on shared line)
WITH sample AS (
SELECT
client_id,
FROM
main_summary
WHERE
sample_id = '42')
Bad (No indent)
WITH sample AS (
SELECT
client_id,
FROM
main_summary
WHERE
sample_id = '42'
)
Boolean at the Beginning of Line
AND
and OR
should always be at the beginning of the line.
For example:
Good
...
WHERE
submission_date > 20180101
AND sample_id = '42'
Bad
...
WHERE
submission_date > 20180101 AND
sample_id = '42'
Nested Queries
Do not use nested queries. Instead, use common table expressions to improve readability.
Good:
WITH sample AS (
SELECT
client_id,
submission_date
FROM
main_summary
WHERE
sample_id = '42'
)
SELECT *
FROM sample
LIMIT 10
Bad:
SELECT *
FROM (
SELECT
client_id,
submission_date
FROM
main_summary
WHERE
sample_id = '42'
)
LIMIT 10
About this Document
This document was heavily influenced by https://www.sqlstyle.guide/
Changes to the style guide should be reviewed by at least one member of both the Data Engineering team and the Data Science team.
Cookbooks
A Cookbook is a focused tutorial to guide you through a focused task. For example, a Cookbook could:
- Introduce you to what types of analyses are common for (e.g.) Search or Crash data
- Guide you through an example analysis to demonstrate the basic principles behind a new statistical technique
Telemetry Alerts
Many Telemetry probes were created to show performance trends over time. Sudden changes happening in Nightly could be the sign of an unintentional performance regression, so we introduced a system to automatically detect and alert developers about such changes.
Thus we created Telemetry Alerts. It comes in two pieces: Cerberus the Detector and Medusa the Front-end.
Cerberus
Every day Cerberus grabs the latest aggregated information about all
non-keyed Telemetry probes from aggregates.telemetry.mozilla.org
and
compares the distribution of values from the Nightly builds of the
past two days to the distribution of values from the Nightly builds of
the past seven days.
It does this by calculating the Bhattacharyya distance between the two distributions and guessing whether or not they are significant and narrow.
It places all detected changes in a file for ingestion by Medusa.
Medusa
Medusa is in charge of emailing people when distributions change and for displaying the website https://alerts.telemetry.mozilla.org which contains pertinent information about each detected regression.
Medusa also checks for expiring histograms and sends emails notifying of their expiry.
What it can do
Telemetry Alerts is very good at identifying sudden changes in the
shapes of normalized distributions of Telemetry probes. If you can see
the distribution of GC_MS
shift from one day
to the next, then likely so can Cerberus.
What can't it do
Telemetry Alerts is not able to see sudden shifts in volume. It is also very easily fooled if a change happens over a long period of time or doesn't fundamentally alter the shape of the probe's histogram.
So if you have a probe like
SCALARS_BROWSER.ENGAGEMENT.MAX_CONCURRENT_TAB_COUNT
,
Cerberus won't notice if:
- The number of pings reporting this value decreased in half, but otherwise reported the same spread of numbers
- The value increases very slowly over time (which I'd expect it to do given how good Session Restore is these days)
- We suddenly received twice as many pings from 200-tab subsessions (the dominance of 1-tab pings would likely ensure the overall shape of the distribution changed insufficiently much for Cerberus to pick up on it)
Telemetry Alert Emails
One of the main ways humans interact with Telemetry Alerts is through the emails sent by Medusa.
At present the email contains a link to the alert's page on https://alerts.telemetry.mozilla.org and a link to a pushlog on https://hg.mozilla.org detailing the changes newly-present in the Nightly build that exhibited the change.
Triaging a Telemetry Alert Email
Congratulations! You have just received a Telemetry Alert!
Now what?
Assumption: Alerts happen because of changes in probes. Changes in probes happen because of changes in related code. If we can identify the code change, we can find the bug that introduced the code change. If we can find the bug, we can ni? the person who made the change.
Goal: Identify the human responsible for the Alert so they can identify if it is good/bad/intentional/exceptional/temporary/permanent/still relevant/having its alerts properly looked after.
Guide:
-
Is this alert just one of a group of similar changes by topic? By build?
-
If there's a group by topic (
SPDY
,URLCLASSIFIER
, ...) check to see if the changes are similar in direction/magnitude. They usually are. -
If there's a group by build but not topic, maybe a large merge kicked things over. Unfortunate, as that will make finding the source more difficult.
-
Open the
hg.mozilla.org
andalerts.telemetry.mozilla.org
links in tabs
-
On
alerts.tmo
, does it look like an improvement or regression? (This is just a first idea and might change. There are often extenuating circumstances that make something that looks bad into an improvement, and vice versa.) -
On
hg.mo
, does the topic of the changed probe exist in the pushlog? In other words, does any part of the probe's name show up in the summaries of any of the commits?
-
From
alerts.tmo
, open the https://telemetry.mozilla.org link by clicking on the plot's title. Open another tab to the Evolution View.
-
Is the change temporary? (might have been noticed elsewhere and backed out)
-
Is the change up or down?
-
Has it happened before?
-
Was it accompanied by a decrease in submission volume? (the second graph at the bottom of the Evolution View)
-
On the Distribution View, did the Sample Count increase? Decrease? (this signifies that the change could be because of the addition or subtraction of a population of values. For instance, we could suddenly stop sending 0 values which would shift the graph to the right. This could be a good thing (we're not handling useless things any longer) a bad thing (something broke and we're no longer measuring the same thing we used to measure) or indifferent)
-
If you still don't have a cause
-
Use DXR or searchfox to find where the probe is accumulated.
-
Click "Log" in that view.
-
Are there any changesets in the resultant
hg.mo
list that ended up in the build we received the Alert for?
-
If you still don't know what's going on
- find a domain expert on IRC and bother them to help you out. Domain knowledge is awesome.
From pursuing these steps or sub-steps you should now have two things: a bug that likely caused the alert, and an idea of what the alert is about.
Now comment on the bug. Feel free to use this script:
This bug may have contributed to a sudden change in the Telemetry probe <PROBE_NAME>[1] which seems to have occurred in Nightly <builddate>[2][3].
There was a <describe the change: increase/decrease, population addition/subtraction, regression/improvement, change in submission/sample volume...>.
This might mean <wild speculation. It'll encourage the ni? to refute it :) >
Is this an improvement? A regression?
Is this intentional? Is this expected?
Is this probe still measuring something useful?
[1]: <the alerts.tmo link>
[2]: <the hg.mo link for the pushlog>
[3]: <the telemetry.mozilla.org link showing the Evolution View>
Then ni? the person who pushed the change. Reply-all to the
dev-telemetry-alerts
mail with a link to the bug and some short notes on
what you found.
From here the user on ni? should get back to you in fairly short order and either help you find the real bug that caused it, or help explain what the Alert was all about. More often than not it is an expected change from a probe that is still operating correctly and there is no action to take...
...except making sure you never have to respond to an Alert for this
probe again, that is. File a bug in that bug's component to update the
Alerting probe to have a valid, monitored alert_emails
field so that
the next time it misbehaves they can be the ones to explain themselves
without you having to spend all this time tracking them down.
Working with Parquet
This guide will give you a quick introduction to working with Parquet files at Mozilla. You can also refer to Spark's documentation on the subject here.
Most of our derived datasets,
like the longitudinal
or main_summary
tables,
are stored in Parquet files.
You can access these datasets in re:dash,
but you may want to access the data from an
ATMO cluster
if SQL isn't powerful enough for your analysis
or if a sample of the data will not suffice.
Table of Contents
Reading Parquet Tables
Spark provides native support for reading parquet files.
The result of loading a parquet file is a
DataFrame.
For example, you can load main_summary
with the following snippet:
# Parquet files are self-describing so the schema is preserved.
main_summary = spark.read.parquet('s3://telemetry-parquet/main_summary/v1/')
You can find the S3 path for common datasets in Choosing a Dataset or in the reference documentation.
Writing Parquet Tables
Saving a table to parquet is a great way to share an intermediate dataset.
Where to save data
You can save data to a subdirectory of the following bucket:
s3://net-mozaws-prod-us-west-2-pipeline-analysis/<username>/
Use your username for the subdirectory name.
This bucket is available to all ATMO clusters and Airflow.
When your analysis is production ready,
open a PR against python_mozetl
.
How to save data
You can save the DataFrame test_dataframe
to the telemetry-test-bucket
with the following command:
test_dataframe.write.mode('error') \
.parquet('s3://telemetry-test-bucket/my_subdir/table_name')
Note: data saved to s3://telemetry-test-bucket
will automatically be deleted
after 30 days.
Accessing Parquet Tables from Re:dash
See Creating a custom re:dash dataset.
Creating Your Own Dataset to Query in re:dash
- Create a spark notebook that does the transformations you need, either on raw data (using Dataset API) or on parquet data
- Output the results of that to an S3 location, usually
telemetry-parquet/user/$YOUR_DATASET/v$VERSION_NUMBER/submission_date=$YESTERDAY/
. This would partition bysubmission_date
, meaning each day this runs and is outputted to a new location in S3. Do NOT put thesubmission_date
in the parquet file as well! A column name cannot also be the name of a partition. Partitioning is optional, but datasets should have a version in the path. - Using this template,
open a bug to publish the dataset (making it available in Spark and Re:dash) with the following attributes:
- Add whiteboard tag
[DataOps]
- Title: "Publish dataset"
- Content: Location of the dataset in S3 (from step 2 above) and the desired table name
- Add whiteboard tag
Sending a Custom Ping
Got some new data you want to send to us? How in the world do you send a new ping? Follow this guide to find out.
Write Your Questions
Do not try and implement new pings unless you know specifically what questions you're trying to answer. General questions about "How do users use our product?" won't cut it - these need to be specific, concrete asks that can be translated to data points. This will also make it easier down the line as you start data review.
More detail on how to design and implement new pings for Firefox Desktop can be found here.
Choose a Namespace and DocType
For new telemetry pings, the namespace is simply telemetry
. For non-Telemetry
pings, choose a namespace that uniquely identifies the product that will be
generating the data.
The DocType is used to differentiate pings within a namespace. It can be as
simple as event
, but should generally be descriptive of the data being
collected.
Both namespace and DocType are limited to the pattern [a-zA-Z-]
. In other words, hyphens and letters from the ISO basic Latin alphabet.
Create a Schema
Use JSON Schema to start with. See the "Adding a new schema" documentation and examples schemas in the Mozilla Pipeline Schemas repo. This schema is just used to validate the incoming data; any ping that doesn't match the schema will be removed. Validate your JSON Schema using a validation tool.
We already have automatic deduplicating based on docId
, which catches about 90% of duplicates and
removes them from the dataset.
Start a Data Review
Data review for new pings is more complicated than when adding new probes. See Data Review for Focus-Event Ping as an example. Consider where the data falls in the Data Collection Categories.
Submit Schema to mozilla-services/mozilla-pipeline-schemas
The first schema added should be the JSON Schema made in step 2. Add at least one example ping which the data can be validated against. These test pings will be validated automatically during the build.
Additionally,
a Parquet output
schema should be added. This would add a new dataset, available in Re:dash.
The best documentation we have for the Parquet schema is by looking at the examples in
mozilla-pipeline-schemas
.
Parquet output also has a metadata
section. These are fields added to the ping at ingestion time;
they might come from the URL submitted to the edge server, or the IP Address used to make the request.
This document
lists available metadata fields for all pings.
The stream you're interested in is probably telemetry
.
For example, look at system-addon-deployment-diagnostics
immediately under the telemetry
top-level
field. The schema
element has top-level fields (e.g. Timestamp
, Type
), as well as more fields
under the Fields
element. Any of these can be used in the metadata
section of your parquet schema,
except for submission
.
Some common ones for Telemetry data might be:
Date
submissionDate
geoCountry
geoCity
geoSubdivision1
geoSubdivision2
normalizedChannel
appVersion
appBuildId
And for non-Telemetry data:
geoCountry
geoCity
geoSubdivision1
geoSubdivision2
documentId
Important Note: Schema evolution of nested structs is currently broken, so you will not be able to add
any fields in the future to your metadata
section. We recommend adding any that may seem useful.
Testing The Schema
For new data, use the edge validator to test your schema.
If your data is already being sent, and you want to test the schema you're writing on the data
that is currently being ingested, you can test your Parquet output in
Hindsight by using an output plugin.
See Core ping output plugin
for an example, where the parquet schema is specified as parquet_schema
. If no errors arise, that
means it should be correct. The "Deploy" button should not be used to actually deploy, that will be
done by operations in the next step.
(Telemetry-specific) Deploy the Plugin
File a bug to deploy the new schema.
Real-time analysis will be key to ensuring your data is being processed and parsed correctly.
It should follow the format specified in
MozTelemetry docType
monitor.
This allows you to check validation errors, size changes, duplicates, and more. Once you have
the numbers set, file a
bug to let ops deploy it.
Start Sending Data
If you're using the Telemetry APIs, use those built-in. These can be with the Gecko Telemetry APIs, the Android Telemetry APIs, or the iOS Telemetry APIs.
For non-Telemetry data, see our HTTP edge server specification
and specifically the non-Telemetry example for the expected format. The edge
server endpoint is https://incoming.telemetry.mozilla.org
.
(Non-Telemetry) Access Your Data
First confirm with the reviewers of your schema pull request that your schemas have been deployed.
In the following links, replace <namespace>
, <doctype>
And <docversion>
with
appropriate values. Also replace -
with _
in <namespace>
if your
namespace contains -
characters.
CEP
Once you've sent some pings, refer to the following real-time analysis plugins to verify that your data is being processed:
https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.moz_generic_error_monitor.<namespace>.html
https://pipeline-cep.prod.mozaws.net/dashboard_output/analysis.moz_generic_<namespace>_<doctype>_<docversion>.submissions.json
https://pipeline-cep.prod.mozaws.net/dashboard_output/analysis.moz_generic_<namespace>_<doctype>_<docversion>.errors.txt
If this first graph shows ingestion errors, you can view the corresponding error messages in the third link. Otherwise, you should be able to view the last ten processed submissions via the second link. You can also write your own custom real-time analysis plugins using this same infrastructure if you desire; use the above plugins as examples and see here for a more detailed explanation.
If you encounter schema validation errors, you can fix your data or
submit another pull request
to amend your schemas. Backwards-incompatible schema changes should generally
be accompanied by an increment to docversion
.
Once you've established that your pings are flowing through the real-time system, verify that you can access the data from the downstream systems.
STMO
In the Athena data source, a new table
<namespace>_<doctype>_parquet_<docversion>
will be created for your data. A
convenience pointer <namespace>_<doctype>_parquet
will also refer to the latest
available docversion
of the ping. The data is partitioned by
submission_date_s3
which is formatted as %Y%m%d
, like 20180130
, and is
generally updated hourly. Refer to the STMO documentation
for general information about using Re:dash.
This table may take up to a day to appear in the Athena source; if you still don't see a table for your new ping after 24 hours, contact Data Operations so that they can investigate. Once the table is available, it should contain all the pings sent during that first day, regardless of how long it takes for the table to appear.
ATMO
The data should be available in S3 at:
s3://net-mozaws-prod-us-west-2-pipeline-data/<namespace>-<doctype>-parquet/v<docversion>/
Note: here <namespace>
should not be escaped.
Refer to the Spark FAQ for details on accessing this table via ATMO.
Write ETL Jobs
We have some basic generalized ETL jobs you can use to transform your data on a batch basis - for example, a Longitudinal or client-count-daily like dataset. Otherwise, you'll have to write your own.
You can schedule it on Airflow, or you can run it as a job in ATMO. If the output is parquet, you can add it to the Hive metastore to have it available in re:dash. Check the docs on creating your own datasets.
Build Dashboards Using ATMO or STMO
Last steps! What are you using this data for anyway?
Working with HyperLogLog in Zeppelin
This guide will set you up to work with HyperLogLog in Zeppelin.
Zeppelin Configuration
-
Launch a Zeppelin notebook
-
Open the panel for Interpreter configuration
- This can be found at
localhost:8890/#/intepreter
- This can be found at
-
Add the Sonatype Snapshot repository
- Expand the Repository Information cog, next to the create button
- Settings are as follows:
ID: Sonatype OSS Snapshots URL: https://oss.sonatype.org/content/repositories/snapshots Snapshot: true
-
Add the dependency to the
Spark
interpreter- spark > Edit > Dependencies
- Add the following entry to artifacts:
com.mozilla.telemetry:spark-hyperloglog:2.0.0-SNAPSHOT
These steps should enable the use of the library within the notebook. Using the
%dep
interpreter to dynamically add the library is currently not supported.
You may want to add a short snippet near the top of the notebook to make the
functions more accessible.
import org.apache.spark.sql.functions.udf
import com.mozilla.spark.sql.hyperloglog.aggregates._
import com.mozilla.spark.sql.hyperloglog.functions._
val HllMerge = new HyperLogLogMerge
val HllCreate = udf(hllCreate _)
val HllCardinality = udf(hllCardinality _)
spark.udf.register("hll_merge", HllMerge)
spark.udf.register("hll_create", HllCreate)
spark.udf.register("hll_cardinality", HllCardinality)
Example Usage
This is a short example which can also be used to verify expected behavior.
case class Example(uid: String, color: String)
val examples = Seq(
Example("uid_1", "red"),
Example("uid_2", "blue"),
Example("uid_3", "blue"),
Example("uid_3", "red"))
val frame = examples.toDF()
In a single expression, we can create and count the unique id's that appear in the DataFrame.
>>> frame
.select(expr("hll_create(uid, 12) as hll"))
.groupBy()
.agg(expr("hll_cardinality(hll_merge(hll)) as count"))
.show()
+-----+
|count|
+-----+
| 3|
+-----+
The code in the previous section defines UDF functions that can be used directly as Spark column expressions. Let's explore the data structure a bit more in slightly more detail.
val example = frame
.select(HllCreate($"uid", lit(12)).alias("hll"), $"color")
.groupBy("color")
.agg(HllMerge($"hll").alias("hll"))
example.createOrReplaceTempView("example")
This groups uid
s by the color
attribute and registers the table with the SQL
context. Each row contains a HLL binary object representing the set of uid
s.
>>> example.show()
+-----|--------------------+
|color| hll|
+-----|--------------------+
| red|[02 0C 00 00 00 0...|
| blue|[02 0C 00 00 00 0...|
+-----|--------------------+
Each HLL object takes up 2^12
bits of space. This configurable size parameter
affects the size and standard error of the cardinality estimates. The
cardinality operator can count the number of uid
s associated with each color
.
>>> example.select($"color", HllCardinality($"hll").alias("count")).show()
+-----|-----+
|color|count|
+-----|-----+
| red| 2|
| blue| 2|
+-----|-----+
We can also write this query in the %sql
interpreter.
%dep sql
SELECT color, hll_cardinality(hll_merge(hll)) as count
FROM example
GROUP BY color
Finally, note that the color
HLL sets have an overlapping uid
. We obtain the
count of uid
s and avoid double counting by merging the sets.
>>> example.groupBy().agg(HllCardinality(HllMerge($"hll")).alias("count")).show()
+-----+
|count|
+-----+
| 3|
+-----+
Longitudinal Examples
Introduction
The longitudinal dataset is a summary of main pings. If you're not sure which
dataset to use for your query, this is probably what you want. It differs from
the main_summary
table in two important ways:
- The longitudinal dataset groups all data for a client-id in the same row. This makes it easy to report profile level metrics. Without this deduplicating, metrics would be weighted by the number of submissions instead of by clients.
- The dataset uses a 1% of all recent profiles, which will reduce query computation time and save resources. The sample of clients will be stable over time.
Accordingly, one should prefer using the Longitudinal dataset except in the rare case where a 100% sample is strictly necessary.
As discussed in the Longitudinal Data Set Example Notebook:
The longitudinal dataset is logically organized as a table where rows
represent profiles and columns the various metrics (e.g. startup time). Each
field of the table contains a list of values, one per Telemetry submission
received for that profile. [...]
The current version of the longitudinal dataset has been build with all
main pings received from 1% of profiles across all channels with [...] up to
180 days of data.
Table structure
To get an overview of the longitudinal data table:
DESCRIBE longitudinal
That table has a row for each client, with columns for the different parts of the ping. There are a lot of fields here, so I recommend downloading the results as a CSV if you want to search through these fields. Unfortunately, there's no way to filter the output of DESCRIBE in Presto.
Because this table combines all rows for a given client id, most columns contain either Arrays or Maps (described below). A few properties are directly available to query on:
SELECT count(*) AS count
FROM longitudinal
WHERE os = 'Linux'
Arrays
Most properties are arrays, which contain one entry for each submission from a given client (newest first). Note that indexing starts at 1:
SELECT reason[1] AS newest_reason
FROM longitudinal
WHERE os = 'Linux'
To expand arrays and maps and work on the data row-wise we can use
UNNEST(array)
.
WITH lengths AS
(SELECT os, greatest(-1, least(31, sl / (24*60*60))) AS days
FROM longitudinal
CROSS JOIN UNNEST(session_length, reason) AS t(sl, r)
WHERE r = 'shutdown' OR r = 'aborted-session')
SELECT os, days, count(*) AS count
FROM lengths
GROUP BY days, os ORDER BY days ASC
However, it may be better to use a sample from the main_summary
table
instead.
Links:
Maps
Some fields like active_addons
or user_prefs
are handled as maps, on
which you can use the []
operator and special functions:
WITH adp AS
(SELECT active_addons[1]['{d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d}']
IS NOT null AS has_adblockplus
FROM longitudinal)
SELECT has_adblockplus, count(*) AS count
FROM adp GROUP BY 1 ORDER BY 2 DESC
Links:
Sampling
While composing queries, it can be helpful to work on small samples to reduce query runtime:
SELECT * FROM longitudinal LIMIT 1000 ...
There's no need to use other sampling methods, such as TABLESAMPLE
, on
the longitudinal set. Rows are randomly ordered, so a LIMIT
sample is
expected to be random.
Example Queries
Blocklist URLs (extensions.blocklist.url
)
SELECT bl, COUNT(bl)
FROM
(SELECT element_at(settings, 1).user_prefs['extensions.blocklist.url'] AS bl
FROM longitudinal)
GROUP BY bl
Blocklist enabled/disabled (extensions.blocklist.enabled
) count:
SELECT bl, COUNT(bl)
FROM
(SELECT element_at(settings, 1).blocklist_enabled AS bl
FROM longitudinal)
GROUP BY bl
Parsing most recent submission_date
SELECT DATE_PARSE(submission_date[1], '%Y-%m-%dT00:00:00.000Z') as parsed_submission_date
FROM longitudinal
Limiting to most recent ping in the last 7 days
SELECT * FROM longitudinal
WHERE DATE_DIFF('day', DATE_PARSE(submission_date[1], '%Y-%m-%dT00:00:00.000Z'), current_date) < 7
Scalar measurement (how many users with more than 100 tabs)
WITH samples AS
(SELECT
client_id,
normalized_channel as channel,
mctc.value AS max_concurrent_tabs
FROM longitudinal
CROSS JOIN UNNEST(scalar_parent_browser_engagement_max_concurrent_tab_count) as t (mctc)
WHERE
scalar_parent_browser_engagement_max_concurrent_tab_count is not null and
mctc.value is not null and
normalized_channel = 'nightly')
SELECT approx_distinct(client_id) FROM samples WHERE max_concurrent_tabs > 100
Keyed scalars
Retrieve all the keys for a given scalar and sum all values for each key giving one row per key:
SELECT t.key as open_type,
SUM(REDUCE(t.val, 0, (s, x) -> s + COALESCE(x.value, 0), s -> s)) as open_count,
normalized_channel AS "channel::multi-filter"
FROM longitudinal
CROSS JOIN UNNEST(scalar_parent_devtools_responsive_open_trigger) AS t(key, val)
GROUP BY t.key, normalized_channel
This query also makes use of multi-filter
to show an interactive filter in
Re:dash.
This query requires a modern version of Presto, and because of this it currently with the Presto data source but it doesn't work with the Athena data source.
Using Views
If you find yourself copy/pasting SQL between different queries, consider using a Presto VIEW to allow for code reuse. Views create logical tables which you can reuse in other queries. For example, this view defines some important filters and derived variables which are then used in this downstream query.
You can define a view by prefixing your query with
CREATE OR REPLACE VIEW view_name AS ...
Be careful not to overwrite an existing view! Using a unique name is important.
Find more information here.
Working offline
It's often useful to keep a local sample of the longitudinal data when
prototyping an analysis. The data is stored in
s3://telemetry-parquet/longitudinal/
. Once you have AWS credentials you
can copy a shard of the parquet dataset to a local directory using
aws s3 cp [filename] .
To request AWS credentials, see this
page.
To initialize your AWS credentials, try aws configure
FAQ
I'm getting an error, "... cannot be resolved"
For some reason, re:dash has trouble parsing SQL strings with double quotes. Try using single quotes instead.
Other Resources
- Presto Docs
- Helpful FAQ covering Presto performance
- Longitudinal schema definition
- Custom dashboards with Re:dash
Working with Crash Pings
Here are some snippets to get you started querying crash pings from the Dataset API.
We can first load and instantiate a Dataset object to query the crash pings, and look at the possible fields to filter on:
from moztelemetry.dataset import Dataset
telem = Dataset.from_source("telemetry")
telem.schema
# => 'submissionDate, sourceName, sourceVersion, docType, appName, appUpdateChannel,
# appVersion, appBuildId'
The more specific these filters, the faster it can be pulled.
The fields can be filtered by either value or a callable.
For example, a version and date range can be specified from the
v5758
and dates
lambdas below:
v5758 = lambda x: x[:2] in ('57', '58')
dates = lambda x: '20180126' <= x <= '20180202'
telem = (
Dataset.from_source("telemetry")
.where(docType='crash', appName="Firefox", appUpdateChannel="release",
appVersion=v5758, submissionDate=dates)
)
Now, referencing the docs for the crash ping, the desired fields
can be selected and brought in as a spark RDD named pings
sel = (
telem.select(
os_name='environment.system.os.name',
os_version='environment.system.os.version',
app_version='application.version',
app_architecture='application.architecture',
clientId='clientId',
creationDate='creationDate',
submissionDate='meta.submissionDate',
sample_id='meta.sampleId',
modules='payload.stackTraces.modules',
stackTraces='payload.stackTraces',
oom_size='payload.metadata.OOMAllocationSize',
AvailablePhysicalMemory='payload.metadata.AvailablePhysicalMemory',
AvailableVirtualMemory='payload.metadata.AvailableVirtualMemory',
TotalPhysicalMemory='payload.metadata.TotalPhysicalMemory',
TotalVirtualMemory='payload.metadata.TotalVirtualMemory',
reason='payload.metadata.MozCrashReason',
payload='payload',
)
)
pings = sel.records(sc)
Creating a Real-time Analysis Plugin
Getting Started
Creating an analysis plugin consists of three steps:
-
Writing a message matcher
The message matcher allows one to select specific data from the data stream.
-
Writing the analysis code/business logic
The analysis code allows one to aggregate, detect anomalies, apply machine learning algorithms etc.
-
Writing the output code
The output code allows one to structure the analysis results in an easy to consume format.
Step by Step Setup
-
Go to the CEP site: https://pipeline-cep.prod.mozaws.net/
-
Login/Register using your Google
@mozilla.com
account -
Click on the
Plugin Deployment
tab -
Create a message matcher
-
Edit the
message_matcher
variable in theHeka Analysis Plugin Configuration
text area. For this example we are selecting all telemetry messages. The full syntax of the message matcher can be found here: http://mozilla-services.github.io/lua_sandbox/util/message_matcher.htmlmessage_matcher = "Type == 'telemetry'"
-
-
Test the message matcher
-
Click the
Run Matcher
button.Your results or error message will appear to the right. You can browse the returned messages to examine their structure and the data they contain; this is very helpful when developing the analysis code but is also useful for data exploration even when not developing a plugin.
-
-
Delete the code in the
Heka Analysis Plugin
text area -
Create the Analysis Code (
process_message
)The
process_message
function is invoked every time a message is matched and should return 0 for success and -1 for failure. Full interface documentation: http://mozilla-services.github.io/lua_sandbox/heka/analysis.html-
Here is the minimum implementation; type it into the
Heka Analysis Plugin
text area:function process_message() return 0 -- success end
-
-
Create the Output Code (
timer_event
)The
timer_event
function is invoked everyticker_interval
seconds.-
Here is the minimum implementation; type it into the
Heka Analysis Plugin
text area:function timer_event() end
-
-
Test the Plugin
-
Click the
Test Plugin
button.Your results or error message will appear to the right. If an error is output, correct it and test again.
-
-
Extend the Code to Perform a Simple Message Count Analysis/Output
-
Replace the code in the
Heka Analysis Plugin
text area with the following:local cnt = 0 function process_message() cnt = cnt + 1 -- count the number of messages that matched return 0 end function timer_event() inject_payload("txt", "types", cnt) -- output the count end
-
-
Test the Plugin
-
Click the
Test Plugin
button.Your results or error message will appear to the right. If an error is output, correct it and test again.
-
-
Extend the Code to Perform a More Complex Count by Type Analysis/Output
-
Replace the code in the
Heka Analysis Plugin
text area with the following:types = {} function process_message() -- read the docType from the message, if it doesn't exist set it to "unknown" local dt = read_message("Fields[docType]") or "unknown" -- look up the docType in the types hash local cnt = types[dt] if cnt then types[dt] = cnt + 1 -- if the type cnt exists, increment it by one else types[dt] = 1 -- if the type cnt didn't exist, initialize it to one end return 0 end function timer_event() add_to_payload("docType = Count\n") -- add a header to the output for k, v in pairs(types) do -- iterate over all the key/values (docTypes/cnt in the hash) add_to_payload(k, " = ", v, "\n") -- add a line to the output end inject_payload("txt", "types") -- finalize all the data written to the payload end
-
-
Test the Plugin
-
Click the
Test Plugin
button.Your results or error message will appear to the right. If an error is output, correct it and test again.
-
-
Deploy the plugin
- Click the
Deploy Plugin
button and dismiss the successfully deployed dialog.
- Click the
-
View the running plugin
- Click the
Plugins
tab and look for the plugin that was just deployed{user}.example
- Right click on the plugin to active the context menu allowing you to view the source or stop the plugin.
- Click the
-
View the plugin output
- Click on the
Dashboards
tab - Click on the
Raw Dashboard Output
link - Click on
analysis.{user}.example.types.txt
link
- Click on the
Where to go from here
- Lua Reference: http://www.lua.org/manual/5.1/manual.html
- Available Lua Modules: https://mozilla-services.github.io/lua_sandbox_extensions/
- Support
- IRC: #hindsight on
irc.mozilla.org
- Mailing list: https://mail.mozilla.org/listinfo/hindsight
- IRC: #hindsight on
See My Pings
So you want to see what you're sending the telemetry pipeline, huh? Well follow these steps and we'll have you reading some JSON in no time.
For a more thorough introduction, see Creating a Real-Time Analysis Plugin Cookbook.
Steps to Create a Viewing Output
-
Get your
clientId
from whatever product you're using. For desktop, it's available inabout:telemetry
. -
Go to the CEP site: https://pipeline-cep.prod.mozaws.net/
-
Login/Register using your Google
@mozilla.com
account -
Click on the "Analysis Plugin Deployment" tab
-
Under "Heka Analysis Plugin Configuration", put the following config:
filename = '<your_name>_<product>_pings.lua'
message_matcher = 'Type == "telemetry" && Fields[docType] == "<doctype>" && Fields[clientId] == "<your_client_id>"'
preserve_data = false
ticker_interval = 60
Where <product>
is whatever product you're testing, and <doctype>
is whatever ping you're testing (e.g. main
, core
, mobile-event
, etc.).
- Under "Heka Analysis Plugin" put the following. This will, by default, show the most recent 10 pings that match your
clientId
on the specifieddocType
.
NOTE: If you are looking at main
, saved-session
, or crash
pings, the submitted data is split out into several pieces. Reading just Fields[submission]
will not give you the entire submitted ping contents. You can change that to e.g. Fields[environment.system]
, Fields[payload.histograms]
, Fields[payload.keyedHistograms]
.
To see all of the available fields, look at a ping in the Matcher tab.
require "string"
require "table"
output = {}
max_len = 10
cur_ind = 1
function process_message()
output[cur_ind] = read_message("Fields[submission]")
cur_ind = cur_ind + 1
if cur_ind > max_len then
cur_ind = 1
end
return 0
end
function timer_event(ns, shutdown)
local res = table.concat(output, ",")
add_to_payload("[" .. res .. "]")
inject_payload("json")
end
-
Click "Run Matcher", then "Test Plugin". Check that no errors appear in "Debug Output"
-
Click "Deploy Plugin". Your output will be available at
https://pipeline-cep.prod.mozaws.net/dashboard_output/analysis.<username>_mozilla_com.<your_name>_<product>_pings..json
CEP Matcher
The CEP Matcher tab lets you easily view some current pings of any ping type. To access it, follow
these first few directions for accessing the CEP. Once there,
click on the "Matcher" tab. The message-matcher is set by default to TRUE
, meaning all pings will
be matched. Click "Run Matcher" and a few pings will show up.
Editing the Message Matcher
Changing the message matcher will filter down the accepted pings, letting you hone in on a certain type.
Generally, you can filter on any fields in a ping. For example, docType
:
Fields[docType] == "main"
Or OS:
Fields[os] == "Android"
We can also combine matchers together:
Fields[docType] == "core" && Fields[os] == "Android" && Fields[appName] == "Focus"
Note that most of the time, you want just proper telemetry pings, so include this in your matcher:
Type == "telemetry"
Which would get us a sample of Focus Android core pings.
The Message Matcher documentation has more information on the syntax.
To see the available fields that you can filter on for any docType
, see this document.
For example, look under the telemetry
top-level field at system-addon-deployment-diagnostics
. The available fields to filter on are:
required binary Logger;
required fixed_len_byte_array(16) Uuid;
optional int32 Pid;
optional int32 Severity;
optional binary EnvVersion;
required binary Hostname;
required int64 Timestamp;
optional binary Payload;
required binary Type;
required group Fields {
required binary submission;
required binary Date;
required binary appUpdateChannel;
required double sourceVersion;
required binary documentId;
required binary docType;
required binary os;
optional binary environment.addons;
optional binary DNT;
required binary environment.partner;
required binary sourceName;
required binary appVendor;
required binary environment.profile;
required binary environment.settings;
required binary normalizedChannel;
required double sampleId;
required binary Host;
required binary geoCountry;
required binary geoCity;
required boolean telemetryEnabled;
required double creationTimestamp;
required binary appVersion;
required binary appBuildId;
required binary environment.system;
required binary environment.build;
required binary clientId;
required binary submissionDate;
required binary appName;
}
So, for example, you could have a message matcher like:
Type == "telemetry" && Fields[geoCountry] == "US"
Active DAU and Active MAU
An Active User is defined as a client who has total_daily_uri
>= 5 URI for a given date.
- Dates are defined by
submission_date_s3
. - A client's
total_daily_uri
is defined as their sum ofscalar_parent_browser_engagement_total_uri_count
for a given date1.
Active DAU is the number of Active Users on a given day.
Active MAU is the number of unique clients who have been an Active User on any day in the last 28 days. In other words, any client that contributes to Active DAU in the last 28 days would also contribute to Active MAU for that day. Note that this is not simply the sum of Active DAU over 28 days, since any particular client could be active on many days.
For quick analysis, using clients_daily_v6
is recommended. Below is an example query for getting Active DAU (aDAU) using clients_daily_v6
.
SELECT
submission_date_s3,
count(*) AS total_clients_cdv6
FROM
clients_daily_v6
WHERE
scalar_parent_browser_engagement_total_uri_count_sum >= 5
GROUP BY
1
ORDER BY
1 ASC
main_summary
can also be used for getting aDAU. Below is an example query using a 1% sample over March 2018 using main_summary
:
SELECT
submission_date_s3,
count(DISTINCT client_id) * 100 as aDAU
FROM
(SELECT
submission_date_s3,
client_id,
sum(coalesce(scalar_parent_browser_engagement_total_uri_count, 0)) as total_daily_uri
FROM
main_summary
WHERE
sample_id = '51'
AND submission_date_s3 >= '20180301'
AND submission_date_s3 < '20180401'
GROUP BY
1, 2) as daily_clients_table
WHERE
total_daily_uri >= 5
GROUP BY
1
ORDER BY
1 ASC
client_count_daily
can be used to get approximate aDAU. This dataset uses HyperLogLog to estimate unique counts. For example:
SELECT
submission_date AS day,
cardinality(merge(cast(hll AS HLL))) AS active_dau
FROM client_count_daily
WHERE
total_uri_count_threshold >= 5
-- Limit to 7 days of history
AND submission_date >= date_format(CURRENT_DATE - INTERVAL '7' DAY, '%Y%m%d')
GROUP BY 1
ORDER BY 1
1: Note, the probe measuring scalar_parent_browser_engagement_total_uri_count
only exists in clients with Firefox 50 and up. Clients on earlier versions of Firefox won't be counted as an Active User (regardless of their use). Similarly, scalar_parent_browser_engagement_total_uri_count
doesn't increment when a client is in Private Browsing mode, so that won't be included as well.
Authored by the Product Data Science Team. Please direct questions/concerns to Ben Miroglio (bmiroglio
).
Retention
Retention measures the rate at which users are continuing to use Firefox, making it one of the more important metrics we track. We commonly measure retention between releases, experiment cohorts, and various Firefox subpopulations to better understand how a change to the user experience or use of a specific feature affect behavior.
N Week Retention
Time is an embedded component of retention. Most retention analysis starts with some anchor, or action that is associated with a date (experiment enrollment date, profile creation date, button clicked on date d, etc.). We then look 1, 2, …, N weeks beyond the anchor to see what percent of users have submitted a ping (signaling their continued use of Firefox).
For example, let’s say we are calculating retention for new Firefox users. Each user can then be anchored by their profile_creation_date
, and we can count the number of users who submitted a ping between 7-13 days after profile creation (1 Week retention), 14-20 days after profile creation (2 Week Retention), etc.
Example Methodology
Given a dataset in Spark, we can construct a field retention_period
that uses submission_date_s3
to determine the period to which a ping belongs (i.e. if a user created their profile on April 1st, all pings submitted between April 8th and April 14th are assigned to week 1). 1-week retention can then be simplified to the percent of users with a 1 value for retention_period
, 2-week retention simplifies to the percent of users with a 2 value for retention_period
, ..., etc. Note that each retention period is independent of the others, so it is possible to have higher 2-week retention than 1-week retention (especially during holidays).
First let's map 1, 2, ..., N week retention the the amount of days elapsed after the anchor point:
PERIODS = {}
N_WEEKS = 6
for i in range(1, N_WEEKS + 1):
PERIODS[i] = {
'start': i * 7,
'end': i * 7 + 6
}
Which gives us
{1: {'end': 13, 'start': 7},
2: {'end': 20, 'start': 14},
3: {'end': 27, 'start': 21},
4: {'end': 34, 'start': 28},
5: {'end': 41, 'start': 35},
6: {'end': 48, 'start': 42}}
Next, let's define some helper functions:
import datetime as dt
import pandas as pd
import pyspark.sql.types as st
import pyspark.sql.functions as F
udf = F.udf
def date_diff(d1, d2, fmt='%Y%m%d'):
"""
Returns days elapsed from d2 to d1 as an integer
Params:
d1 (str)
d2 (str)
fmt (str): format of d1 and d2 (must be the same)
>>> date_diff('20170205', '20170201')
4
>>> date_diff('20170201', '20170205)
-4
"""
try:
return (pd.to_datetime(d1, format=fmt) -
pd.to_datetime(d2, format=fmt)).days
except:
return None
@udf(returnType=st.IntegerType())
def get_period(anchor, submission_date_s3):
"""
Given an anchor and a submission_date_s3,
returns what period a ping belongs to. This
is a spark UDF.
Params:
anchor (col): anchor date
submission_date_s3 (col): a ping's submission_date to s3
Global:
PERIODS (dict): defined globally based on n-week method
Returns an integer indicating the retention period
"""
if anchor is not None:
diff = date_diff(submission_date_s3, anchor)
if diff >= 7: # exclude first 7 days
for period in sorted(PERIODS):
if diff <= PERIODS[period]['end']:
return period
@udf(returnType=st.StringType())
def from_unixtime_handler(ut):
"""
Converts unix time (in days) to a string in %Y%m%d format.
This is a spark UDF.
Params:
ut (int): unix time in days
Returns a date as a string if it is parsable by datetime, otherwise None
"""
if ut is not None:
try:
return (dt.datetime.fromtimestamp(ut * 24 * 60 * 60).strftime("%Y%m%d"))
except:
return None
Now we can load in a subset of main_summary
and construct the necessary fields for retention calculations:
ms = spark.sql("""
SELECT
client_id,
submission_date_s3,
profile_creation_date,
os
FROM main_summary
WHERE
submission_date_s3 >= '20180401'
AND submission_date_s3 <= '20180603'
AND sample_id = '42'
AND app_name = 'Firefox'
AND normalized_channel = 'release'
AND os in ('Darwin', 'Windows_NT', 'Linux')
""")
PCD_CUTS = ('20180401', '20180415')
ms = (
ms.withColumn("pcd", from_unixtime_handler("profile_creation_date")) # i.e. 17500 -> '20171130'
.filter("pcd >= '{}'".format(PCD_CUTS[0]))
.filter("pcd <= '{}'".format(PCD_CUTS[1]))
.withColumn("period", get_period("pcd", "submission_date_s3"))
)
Note that we filter to profiles that were created in the first half of April so that we have sufficient time to observe 6 weeks of behavior. Now we can calculate retention!
os_counts = (
ms
.groupby("os")
.agg(F.countDistinct("client_id").alias("total_clients"))
)
weekly_counts = (
ms
.groupby("period", "os")
.agg(F.countDistinct("client_id").alias("n_week_clients"))
)
retention_by_os = (
weekly_counts
.join(os_counts, on='os')
.withColumn("retention", F.col("n_clients") / F.col("total_count"))
)
Peeking at 6-Week Retention
retention_by_os.filter("period = 6").show()
+----------+------+--------------+-------------+-------------------+
| os|period|n_week_clients|total_clients| retention|
+----------+------+--------------+-------------+-------------------+
| Linux| 6| 1495| 22422|0.06667558647756668|
| Darwin| 6| 1288| 4734|0.27207435572454586|
|Windows_NT| 6| 29024| 124872|0.23243000832852842|
+----------+------+--------------+-------------+-------------------+
we observe that 6.7% of Linux users whose profile was created in the first half of April submitted a ping 6 weeks later, and so forth. The example code snippets are consolidated in this notebook.
New vs. Existing User Retention
The above example calculates New User Retention, which is distinct from Existing User Retention. This distinction is important when understanding retention baselines (i.e. does this number make sense?). Existing users typically have much higher retention numbers than new users.
Note that is more common in industry to refer to Existing User Retention as "Churn" (Churn = 1 - Retention), however, we use retention across the board for the sake of consistency and interpretability.
Please be sure to specify whether or not your retention analysis is for new or existing users.
What If There's No Anchor Point?
Sometimes there isn't a clear anchor point like profile_creation_date
or enrollment_date
.
For example, imagine you are tasked with reporting retention numbers for users that enabled sync (sync_configured
) compared to users that haven't. Being a boolean pref, there is no straightforward way to determine when sync_enabled
flipped from false
to true
aside from looking at a client's entire history (which is not recommended!). What now?
We can construct an artificial anchor point using fixed weekly periods; the retention concepts then remain unchanged. The process can be summarized by the following steps:
- Define a baseline week cohort
- For this example let's define the baseline as users that submitted pings between 2018-01-01 and 2018-01-07
- Count all users with/without sync enabled in this period
- Assign these users to an anchor point of 2018-01-01 (the beginning of the baseline week)
- Count the number of users in the baseline week that submitted a ping between 7-13 days after 2018-01-01 (1 Week retention), 14-20 days after 2018-01-01 (2 Week Retention), etc.
- Shift the baseline week up 7 days (and all other dates) and repeat as necessary
This method is also valid in the presence of an anchor point, however, it is recommended the anchor point method is employed when possible.
Confounding Factors
When performing retention analysis between two or more groups, it is important to look at other usage metrics to get an understanding of other influential factors.
For example (borrowing the sync example from the previous section) you find that users with and without sync have a 1 week retention of 0.80 and 0.40, respectively. Wow--we should really be be promoting sync as it could double retention numbers!
Not quite. Turns out you next look at active_ticks
and total_uri_count
and find that sync users report much higher numbers for these measures as well. Now how can we explain this difference in retention?
There could be an entirely separate cookbook devoted to answering this question, however this contrived example is meant to demonstrate that simply comparing retention numbers between two groups isn't capturing the full story. Sans an experiment or model-based approach, all we can say is "enabling sync is associated with higher retention numbers." There is still value in this assertion, however it should be stressed that association/correlation != causation!
Dataset Reference
After completing Choosing a Dataset you should have a high level understanding of what questions each dataset is able to answer. This section contains references that focus on a single dataset each. Reading this section front to back is not recommended. Instead, identify a dataset you'd like to understand better and read through the relevant documentation. After reading the tutorial, you should know all you need about the dataset.
Each tutorial should include:
- Introduction
- A short overview of why we built the dataset and what need it's meant to solve
- What data source the data is collected from, and a high level overview of how the data is organized
- How it is stored and how to access the data including
- whether the data is available in re:dash
- S3 paths
- Reference
- An example query to give the reader an idea of what the data looks like and how it is meant to be used
- How the data is processed and sampled
- How frequently it's updated, and how it's scheduled
- An up-to-date schema for the dataset
- How to augment or modify the dataset
Raw Ping Data
Introduction
We receive data from our users via pings. There are several types of pings, each containing different measurements and sent for different purposes. To review a complete list of ping types and their schemata, see this section of the Mozilla Source Tree Docs.
Many pings are also described by a JSONSchema specification which can be found in this repository.
Background and Caveats
The large majority of analyses can be completed using only the main ping. This ping includes histograms, scalars, events, and other performance and diagnostic data.
Few analyses actually rely directly on the raw ping data. Instead, we provide derived datasets which are processed versions of these data, made to be:
- Easier and faster to query
- Organized to make the data easier to analyze
- Cleaned of erroneous or misleading data
Before analyzing raw ping data, check to make sure there isn't already a derived dataset made for your purpose. If you do need to work with raw ping data, be aware that loading the data can take a while. Try to limit the size of your data by controlling the date range, etc.
Accessing the Data
You can access raw ping data from an ATMO cluster using the Dataset API. Raw ping data are not available in re:dash.
Further Reading
You can find the complete ping documentation. To augment our data collection, see Collecting New Data and the Data Collection Policy.
Data Reference
You can find the reference documentation for all ping types here.
Derived Datasets
See Choosing a Dataset for a discussion on the differences between pings and derived datasets.
Addons Datasets
Introduction
This is a work in progress. The work is being tracked here.
Data Reference
Example Queries
Sampling
It contains one or more records for every
Main Summary
record that contains a non-null value for client_id
.
Each Addons record contains info for a single addon,
or if the main ping did not contain any active addons,
there will be a row with nulls for all the addon fields
(to identify client_id
s/records without any addons).
Like the Main Summary dataset, No attempt is made to de-duplicate submissions by documentId
, so any analysis that could be affected by duplicate records should take care to remove duplicates using the documentId
field.
Scheduling
This dataset is updated daily via the telemetry-airflow infrastructure. The job DAG runs every day after the Main Summary data has been generated. The DAG is here.
Schema
As of 2017-03-16, the current version of the addons
dataset is v2
,
and has a schema as follows:
root
|-- document_id: string (nullable = true)
|-- client_id: string (nullable = true)
|-- subsession_start_date: string (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- addon_id: string (nullable = true)
|-- blocklisted: boolean (nullable = true)
|-- name: string (nullable = true)
|-- user_disabled: boolean (nullable = true)
|-- app_disabled: boolean (nullable = true)
|-- version: string (nullable = true)
|-- scope: integer (nullable = true)
|-- type: string (nullable = true)
|-- foreign_install: boolean (nullable = true)
|-- has_binary_components: boolean (nullable = true)
|-- install_day: integer (nullable = true)
|-- update_day: integer (nullable = true)
|-- signed_state: integer (nullable = true)
|-- is_system: boolean (nullable = true)
|-- submission_date_s3: string (nullable = true)
|-- sample_id: string (nullable = true)
For more detail on where these fields come from in the
raw data,
please look
in the AddonsView
code.
The fields are all simple scalar values.
Churn
Introduction
The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.
Content
Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0)
for groups of clients with some shared attributes. A group of clients with
shared attributes is called a cohort. The cohorts in this dataset are created
every week and can be tracked over time using the acquisition_date
and the
weeks since acquisition or current_week
.
The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.
client | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | X | X | |||||||||||||
B | X | X | X | X | X | X | |||||||||
C | X | X |
All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 3.
The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..
client | 0 | 1 | 2 |
---|---|---|---|
A | X | X | |
B | X | ||
C | X | X |
The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.
Background and Caveats
- Each row in this dataset describes a unique segment of users
- The number of rows is exponential with the number of dimensions
- New fields should be added sparing to account for data-set size
- The dataset lags by 10 days in order account for submission latency
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
However,
churn_v3
still tracks releases older than Firefox 55.
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
However,
- The start of the period is fixed to Sundays. Once it has been aggregated, the
period cannot be shifted due to the way clients are counted.
- A supplementary 1-day
retention
dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO withmain_summary
orclients_daily
datasets.
- A supplementary 1-day
Accessing the Data
churn
is available in Re:dash under Athena and Presto. The data is also
available in parquet for consumption by columnar data engines at
s3://telemetry-parquet/churn/v3
.
Data Reference
Example Queries
This section walks through a typical query to generate data suitable for visualization.
field | type | description |
---|---|---|
cohort_date | common, attribute | The start date bucket of the cohort. This is week the client was acquired. |
elapsed_periods | common, attribute | The number of periods that have elapsed since the cohort date. In this dataset, the retention period is 7 days. |
channel | attribute | Part of the release train model. An attribute that distinguishes cohorts. |
geo | filter attribute | Country code. Used to filter out all countries other than the 'US' |
n_profiles | metric | Count of users in a cohort. Use sum to aggregate. |
First the fields are extracted and aliased for consistency. cohort_date
and
elapsed_periods
are common to most retention queries and are useful concepts
for building on other datasets.
WITH extracted AS (
SELECT acquisition_period AS cohort_date,
current_week AS elapsed_periods,
n_profiles,
channel,
geo
FROM churn
),
The extracted table is filtered down to the attributes of interest. The cohorts
of interest originate in the US and are in the release or beta channels. Note
that channel
here is the concatenation of the normalized channel and the
funnelcake id. Only cohorts appearing after August 6, 2017 are chosen to be in
this population.
population AS (
SELECT channel,
cohort_date,
elapsed_periods,
n_profiles
FROM extracted
WHERE geo = 'US'
AND channel IN ('release', 'beta')
AND cohort_date > '20170806'
-- filter out noise from clients with incorrect dates
AND elapsed_periods >= 0
AND elapsed_periods < 12
),
The number of profiles is aggregated by the cohort dimensions. The cohort acquisition date and elapsed periods since acquisition are fundamental to cohort analysis.
cohorts AS (
SELECT channel,
cohort_date,
elapsed_periods,
sum(n_profiles) AS n_profiles
FROM population
GROUP BY 1, 2, 3
),
The table will have the following structure. The table is sorted by the first three columns for demonstration.
channel | cohort_date | elapsed_periods | n_profiles |
---|---|---|---|
release | 20170101 | 0 | 100 |
release | 20170101 | 1 | 90 |
release | 20170101 | 2 | 80 |
... | ... | ... | ... |
beta | 20170128 | 10 | 25 |
Finally, retention is calculated through the number of profiles at the time of
the elapsed_period
relative to the initial period. This data can be imported
into a pivot table for further analysis.
results AS (
SELECT c.*,
iv.n_profiles AS total_n_profiles,
(0.0+c.n_profiles)*100/iv.n_profiles AS percentage_n_profiles
FROM cohorts c
JOIN (
SELECT *
FROM cohorts
WHERE elapsed_periods = 0
) iv ON (
c.cohort_date = iv.cohort_date
AND c.channel = iv.channel
)
)
channel | cohort_date | elapsed_periods | n_profiles | total_n_profiles | percentage_n_profiles |
---|---|---|---|---|---|
release | 20170101 | 0 | 100 | 100 | 1.0 |
release | 20170101 | 1 | 90 | 100 | 0.9 |
release | 20170101 | 2 | 80 | 100 | 0.8 |
... | .... | ... | ... | ... | ... |
beta | 20170128 | 10 | 25 | 50 | 0.5 |
Obtain the results.
SELECT *
FROM results
You may consider visualizing using cohort graphs, line charts, or a pivot tables. See Firefox Telemetry Retention: Dataset Example Usage for more examples.
Scheduling
The aggregated churn data is updated weekly on Wednesday.
Schema
As of 2017-10-15, the current version of churn
is v3
and has a schema as follows:
root
|-- channel: string (nullable = true)
|-- geo: string (nullable = true)
|-- is_funnelcake: string (nullable = true)
|-- acquisition_period: string (nullable = true)
|-- start_version: string (nullable = true)
|-- sync_usage: string (nullable = true)
|-- current_version: string (nullable = true)
|-- current_week: long (nullable = true)
|-- source: string (nullable = true)
|-- medium: string (nullable = true)
|-- campaign: string (nullable = true)
|-- content: string (nullable = true)
|-- distribution_id: string (nullable = true)
|-- default_search_engine: string (nullable = true)
|-- locale: string (nullable = true)
|-- is_active: string (nullable = true)
|-- n_profiles: long (nullable = true)
|-- usage_hours: double (nullable = true)
|-- sum_squared_usage_hours: double (nullable = true)
|-- total_uri_count: long (nullable = true)
|-- unique_domains_count_per_profile: double (nullable = true)
Code Reference
The script for generating churn
currently lives in
mozilla/python_mozetl
. The job can
be found in
mozetl/engagement/churn
.
Client Count Daily Reference
This document is a work in progress. The work is being tracked here.
Introduction
The client_count_daily
dataset is useful for estimating user counts over a few
pre-defined dimensions.
The client_count_daily
dataset is similar to the deprecated
client_count
dataset
except that is aggregated by submission date and not activity date.
Content
This dataset includes columns for a dozen factors and an HLL variable.
The hll
column contains a
HyperLogLog
variable, which is an approximation to the exact count.
The factor columns include submission date and the dimensions listed
here.
Each row represents one combinations of the factor columns.
Background and Caveats
It's important to understand that the hll
column is not a standard count.
The hll
variable avoids double-counting users when aggregating over multiple days.
The HyperLogLog variable is a far more efficient way to count distinct elements of a set,
but comes with some complexity.
To find the cardinality of an HLL use cardinality(cast(hll AS HLL))
.
To find the union of two HLL's over different dates, use merge(cast(hll AS HLL))
.
The Firefox ER Reporting Query
is a good example to review.
Finally, Roberto has a relevant write-up
here.
Accessing the Data
The data is available in Re:dash. Take a look at this example query.
I don't recommend accessing this data from ATMO.
Further Reading
Data Reference
Example Queries
Compute DAU for non-windows clients for the last week
WITH sample AS (
SELECT
os,
submission_date,
cardinality(merge(cast(hll AS HLL))) AS count
FROM client_count_daily
WHERE submission_date >= DATE_FORMAT(CURRENT_DATE - INTERVAL '7' DAY, '%Y%m%d')
GROUP BY
submission_date,
os
)
SELECT
os,
-- formatting date as late as possible improves performance dramatically
date_parse(submission_date, '%Y%m%d') AS submission_date,
count
FROM sample
WHERE
count > 10 -- remove outliers
AND lower(os) NOT LIKE '%windows%'
ORDER BY
os,
submission_date DESC
Compute WAU by Channel for the last week
WITH dau AS (
SELECT
normalized_channel,
submission_date,
merge(cast(hll AS HLL)) AS hll
FROM client_count_daily
-- 2 days of lag, 7 days of results, and 6 days preceding for WAU
WHERE submission_date > DATE_FORMAT(CURRENT_DATE - INTERVAL '15' DAY, '%Y%m%d')
GROUP BY
submission_date,
normalized_channel
),
wau AS (
SELECT
normalized_channel,
submission_date,
cardinality(merge(hll) OVER (
PARTITION BY normalized_channel
ORDER BY submission_date
ROWS BETWEEN 6 PRECEDING AND 0 FOLLOWING
)) AS count
FROM dau
)
SELECT
normalized_channel,
-- formatting date as late as possible improves performance dramatically
date_parse(submission_date, '%Y%m%d') AS submission_date,
count
FROM wau
WHERE
count > 10 -- remove outliers
AND submission_date > DATE_FORMAT(CURRENT_DATE - INTERVAL '9' DAY, '%Y%m%d') -- only days that have a full WAU
Caveats
The hll
column does not product an exact count. hll
stands for
HyperLogLog, a sophisticated
algorithm that allows for the counting of extremely high numbers of items,
sacrificing a small amount of accuracy in exchange for using much less memory
than a simple counting structure.
When count is calculated over a column that may change over time, such as
total_uri_count_threshold
, then a client would be counted in every group
where they appear. Over longer windows, like MAU, this is more likely to occur.
Scheduling
This dataset is updated daily via the
telemetry-airflow infrastructure.
The job runs as part of the main_summary
DAG.
Schema
The data is partitioned by submission_date
which is formatted as %Y%m%d
,
like 20180130
.
As of 2018-03-15, the current version of the client_count_daily
dataset
is v2
, and has a schema as follows:
root
|-- app_name: string (nullable = true)
|-- app_version: string (nullable = true)
|-- country: string (nullable = true)
|-- devtools_toolbox_opened: boolean (nullable = true)
|-- e10s_enabled: boolean (nullable = true)
|-- hll: binary (nullable = true)
|-- locale: string (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- os: string (nullable = true)
|-- os_version: string (nullable = true)
|-- top_distribution_id: string (nullable = true)
|-- total_uri_count_threshold: integer (nullable = true)
The client_count
dataset is deprecated in favor of
client_count_daily
,
which is aggregated by submission date instead of activity date.
Clients Daily
Introduction
The clients_daily
table is intended as the first stop for asking questions
about how people use Firefox. It should be easy to answer simple questions.
Each row in the table is a (client_id
, submission_date
) and contains a
number of aggregates about that day's activity.
Contents
Many questions about Firefox take the form "What did clients with
characteristics X, Y, and Z do during the period S to E?" The
clients_daily
table is aimed at answer those questions.
Accessing the Data
The data is stored as a parquet table in S3 at the following address.
s3://telemetry-parquet/clients_daily/v6/
The clients_daily
table is accessible through re:dash using the Athena
data source. It is also available via the Presto
data source, though
Athena
should be preferred for performance and stability reasons.
Here's an example query.
Data Reference
Example Queries
Compute Churn for a one-day cohort:
SELECT
date_parse(submission_date_s3, '%Y%m%d') AS submission_date_s3,
approx_distinct(client_id) AS cohort_dau
FROM clients_daily
WHERE
submission_date_s3 > '20170831'
AND submission_date_s3 < '20171001'
AND profile_creation_date LIKE '2017-09-01%'
GROUP BY 1
ORDER BY 1
Distribution of pings per client per day:
SELECT
normalized_channel,
CASE
WHEN pings_aggregated_by_this_row > 50 THEN 50
ELSE pings_aggregated_by_this_row
END AS pings_per_day,
approx_distinct(client_id) AS client_count
FROM clients_daily
WHERE
submission_date_s3 = '20170901'
AND normalized_channel <> 'Other'
GROUP BY
1,
2
ORDER BY
2,
1
Scheduling
This dataset is updated daily via the
telemetry-airflow infrastructure.
The job runs as part of the main_summary
DAG.
Schema
The data is partitioned by submission_date_s3
which is formatted as %Y%m%d
,
like 20180130
.
As of 2018-05-16, the current version of the clients_daily
dataset is v6
, and has a schema as follows:
root
|-- client_id: string (nullable = true)
|-- aborts_content_sum: long (nullable = true)
|-- aborts_gmplugin_sum: long (nullable = true)
|-- aborts_plugin_sum: long (nullable = true)
|-- active_addons_count_mean: double (nullable = true)
|-- active_experiment_branch: string (nullable = true)
|-- active_experiment_id: string (nullable = true)
|-- active_hours_sum: double (nullable = true)
|-- addon_compatibility_check_enabled: boolean (nullable = true)
|-- app_build_id: string (nullable = true)
|-- app_display_version: string (nullable = true)
|-- app_name: string (nullable = true)
|-- app_version: string (nullable = true)
|-- blocklist_enabled: boolean (nullable = true)
|-- channel: string (nullable = true)
|-- city: string (nullable = true)
|-- client_clock_skew_mean: double (nullable = true)
|-- client_submission_latency_mean: double (nullable = true)
|-- country: string (nullable = true)
|-- cpu_cores: integer (nullable = true)
|-- cpu_count: integer (nullable = true)
|-- cpu_family: integer (nullable = true)
|-- cpu_l2_cache_kb: integer (nullable = true)
|-- cpu_l3_cache_kb: integer (nullable = true)
|-- cpu_model: integer (nullable = true)
|-- cpu_speed_mhz: integer (nullable = true)
|-- cpu_stepping: integer (nullable = true)
|-- cpu_vendor: string (nullable = true)
|-- crashes_detected_content_sum: long (nullable = true)
|-- crashes_detected_gmplugin_sum: long (nullable = true)
|-- crashes_detected_plugin_sum: long (nullable = true)
|-- crash_submit_attempt_content_sum: long (nullable = true)
|-- crash_submit_attempt_main_sum: long (nullable = true)
|-- crash_submit_attempt_plugin_sum: long (nullable = true)
|-- crash_submit_success_content_sum: long (nullable = true)
|-- crash_submit_success_main_sum: long (nullable = true)
|-- crash_submit_success_plugin_sum: long (nullable = true)
|-- default_search_engine: string (nullable = true)
|-- default_search_engine_data_load_path: string (nullable = true)
|-- default_search_engine_data_name: string (nullable = true)
|-- default_search_engine_data_origin: string (nullable = true)
|-- default_search_engine_data_submission_url: string (nullable = true)
|-- devtools_toolbox_opened_count_sum: long (nullable = true)
|-- distribution_id: string (nullable = true)
|-- e10s_enabled: boolean (nullable = true)
|-- env_build_arch: string (nullable = true)
|-- env_build_id: string (nullable = true)
|-- env_build_version: string (nullable = true)
|-- experiments: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- first_paint_mean: double (nullable = true)
|-- flash_version: string (nullable = true)
|-- geo_subdivision1: string (nullable = true)
|-- geo_subdivision2: string (nullable = true)
|-- gfx_features_advanced_layers_status: string (nullable = true)
|-- gfx_features_d2d_status: string (nullable = true)
|-- gfx_features_d3d11_status: string (nullable = true)
|-- gfx_features_gpu_process_status: string (nullable = true)
|-- install_year: long (nullable = true)
|-- is_default_browser: boolean (nullable = true)
|-- is_wow64: boolean (nullable = true)
|-- locale: string (nullable = true)
|-- memory_mb: integer (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- normalized_os_version: string (nullable = true)
|-- os: string (nullable = true)
|-- os_service_pack_major: long (nullable = true)
|-- os_service_pack_minor: long (nullable = true)
|-- os_version: string (nullable = true)
|-- pings_aggregated_by_this_row: long (nullable = true)
|-- places_bookmarks_count_mean: double (nullable = true)
|-- places_pages_count_mean: double (nullable = true)
|-- plugin_hangs_sum: long (nullable = true)
|-- plugins_infobar_allow_sum: long (nullable = true)
|-- plugins_infobar_block_sum: long (nullable = true)
|-- plugins_infobar_shown_sum: long (nullable = true)
|-- plugins_notification_shown_sum: long (nullable = true)
|-- previous_build_id: string (nullable = true)
|-- profile_age_in_days: integer (nullable = true)
|-- profile_creation_date: string (nullable = true)
|-- push_api_notify_sum: long (nullable = true)
|-- sample_id: string (nullable = true)
|-- sandbox_effective_content_process_level: integer (nullable = true)
|-- scalar_combined_webrtc_nicer_stun_retransmits_sum: long (nullable = true)
|-- scalar_combined_webrtc_nicer_turn_401s_sum: long (nullable = true)
|-- scalar_combined_webrtc_nicer_turn_403s_sum: long (nullable = true)
|-- scalar_combined_webrtc_nicer_turn_438s_sum: long (nullable = true)
|-- scalar_content_navigator_storage_estimate_count_sum: long (nullable = true)
|-- scalar_content_navigator_storage_persist_count_sum: long (nullable = true)
|-- scalar_parent_aushelper_websense_reg_version: string (nullable = true)
|-- scalar_parent_browser_engagement_max_concurrent_tab_count_max: integer (nullable = true)
|-- scalar_parent_browser_engagement_max_concurrent_window_count_max: integer (nullable = true)
|-- scalar_parent_browser_engagement_tab_open_event_count_sum: long (nullable = true)
|-- scalar_parent_browser_engagement_total_uri_count_sum: long (nullable = true)
|-- scalar_parent_browser_engagement_unfiltered_uri_count_sum: long (nullable = true)
|-- scalar_parent_browser_engagement_unique_domains_count_max: integer (nullable = true)
|-- scalar_parent_browser_engagement_unique_domains_count_mean: double (nullable = true)
|-- scalar_parent_browser_engagement_window_open_event_count_sum: long (nullable = true)
|-- scalar_parent_devtools_copy_full_css_selector_opened_sum: long (nullable = true)
|-- scalar_parent_devtools_copy_unique_css_selector_opened_sum: long (nullable = true)
|-- scalar_parent_devtools_toolbar_eyedropper_opened_sum: long (nullable = true)
|-- scalar_parent_dom_contentprocess_troubled_due_to_memory_sum: long (nullable = true)
|-- scalar_parent_navigator_storage_estimate_count_sum: long (nullable = true)
|-- scalar_parent_navigator_storage_persist_count_sum: long (nullable = true)
|-- scalar_parent_storage_sync_api_usage_extensions_using_sum: long (nullable = true)
|-- search_cohort: string (nullable = true)
|-- search_count_all: long (nullable = true)
|-- search_count_abouthome: long (nullable = true)
|-- search_count_contextmenu: long (nullable = true)
|-- search_count_newtab: long (nullable = true)
|-- search_count_searchbar: long (nullable = true)
|-- search_count_system: long (nullable = true)
|-- search_count_urlbar: long (nullable = true)
|-- session_restored_mean: double (nullable = true)
|-- sessions_started_on_this_day: long (nullable = true)
|-- shutdown_kill_sum: long (nullable = true)
|-- subsession_hours_sum: decimal(37,6) (nullable = true)
|-- ssl_handshake_result_failure_sum: long (nullable = true)
|-- ssl_handshake_result_success_sum: long (nullable = true)
|-- sync_configured: boolean (nullable = true)
|-- sync_count_desktop_sum: long (nullable = true)
|-- sync_count_mobile_sum: long (nullable = true)
|-- telemetry_enabled: boolean (nullable = true)
|-- timezone_offset: integer (nullable = true)
|-- total_hours_sum: decimal(27,6) (nullable = true)
|-- update_auto_download: boolean (nullable = true)
|-- update_channel: string (nullable = true)
|-- update_enabled: boolean (nullable = true)
|-- vendor: string (nullable = true)
|-- web_notification_shown_sum: long (nullable = true)
|-- windows_build_number: long (nullable = true)
|-- windows_ubr: long (nullable = true)
Code Reference
This dataset is generated by
telemetry-batch-view
.
Refer to this repository for information on how to run or augment the dataset.
Crash Aggregates Reference
Introduction
The crash_aggregates
dataset compiles crash statistics over various dimensions for each day.
Rows and Columns
There's one column for each of the stratifying dimensions and the crash statistics. Each row is a distinct set of dimensions, along with their associated crash stats. Example stratifying dimensions include channel and country, example statistics include usage hours and plugin crashes. See the complete documentation for all available dimensions and statistics.
Accessing the Data
This dataset is accessible via re:dash.
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/crash_aggregates/v1/
Further Reading
The technical documentation for this dataset can be found in the telemetry-batch-view documentation
Data Reference
Example Queries
Here's an example query that computes crash rates for each channel (sorted by number of usage hours):
SELECT dimensions['channel'] AS channel,
sum(stats['usage_hours']) AS usage_hours,
1000 * sum(stats['main_crashes']) / sum(stats['usage_hours']) AS main_crash_rate,
1000 * sum(stats['content_crashes']) / sum(stats['usage_hours']) AS content_crash_rate,
1000 * sum(stats['plugin_crashes']) / sum(stats['usage_hours']) AS plugin_crash_rate,
1000 * sum(stats['gmplugin_crashes']) / sum(stats['usage_hours']) AS gmplugin_crash_rate,
1000 * sum(stats['gpu_crashes']) / sum(stats['usage_hours']) AS gpu_crash_rate
FROM crash_aggregates
GROUP BY dimensions['channel']
ORDER BY -sum(stats['usage_hours'])
Main process crashes by build date and OS version.
WITH channel_rates AS (
SELECT dimensions['build_id'] AS build_id,
SUM(stats['main_crashes']) AS main_crashes, -- total number of crashes
SUM(stats['usage_hours']) / 1000 AS usage_kilohours, -- thousand hours of usage
dimensions['os_version'] AS os_version -- os version
FROM crash_aggregates
WHERE dimensions['experiment_id'] is null -- not in an experiment
AND regexp_like(dimensions['build_id'], '^\d{14}$') -- validate build IDs
AND dimensions['build_id'] > '20160201000000' -- only in the date range that we care about
GROUP BY dimensions['build_id'], dimensions['os_version']
)
SELECT cast(parse_datetime(build_id, 'yyyyMMddHHmmss') as date) as build_id, -- program build date
usage_kilohours, -- thousands of usage hours
os_version, -- os version
main_crashes / usage_kilohours AS main_crash_rate -- crash rate being defined as crashes per thousand usage hours
FROM channel_rates
WHERE usage_kilohours > 100 -- only aggregates that have statistically significant usage hours
ORDER BY build_id ASC
Sampling
Invalid Pings
We ignore invalid pings in our processing. Invalid pings are defined as those that:
- The submission dates or activity dates are invalid or missing.
- The build ID is malformed.
- The
docType
field is missing or unknown. - The ping is a main ping without usage hours or a crash ping with usage hours.
Scheduling
The crash_aggregates
job is run daily, at midnight UTC.
The job is scheduled on Airflow.
The DAG is here
Schema
The crash_aggregates
table has 4 commonly-used columns:
submission_date
is the date pings were submitted for a particular aggregate.- For example,
select sum(stats['usage_hours']) from crash_aggregates where submission_date = '2016-03-15'
will give the total number of user hours represented by pings submitted on March 15, 2016. - The dataset is partitioned by this field. Queries that limit the possible values of
submission_date
can run significantly faster.
- For example,
activity_date
is the day when the activity being recorded took place.- For example,
select sum(stats['usage_hours']) from crash_aggregates where activity_date = '2016-03-15'
will give the total number of user hours represented by activities that took place on March 15, 2016. - This can be several days before the pings are actually submitted, so it will always be before or on its corresponding
submission_date
. - Therefore, queries that are sensitive to when measurements were taken on the client should prefer this field over
submission_date
.
- For example,
dimensions
is a map of all the other dimensions that we currently care about. These fields include:dimensions['build_version']
is the program version, like46.0a1
.dimensions['build_id']
is theYYYYMMDDhhmmss
timestamp the program was built, like20160123180541
. This is also known as thebuild ID
orbuildid
.dimensions['channel']
is the channel, likerelease
orbeta
.dimensions['application']
is the program name, likeFirefox
orFennec
.dimensions['os_name']
is the name of the OS the program is running on, likeDarwin
orWindows_NT
.dimensions['os_version']
is the version of the OS the program is running on.dimensions['architecture']
is the architecture that the program was built for (not necessarily the one it is running on).dimensions['country']
is the country code for the user (determined using geoIP), likeUS
orUK
.dimensions['experiment_id']
is the identifier of the experiment being participated in, such ase10s-beta46-noapz@experiments.mozilla.org
, or null if no experiment.dimensions['experiment_branch']
is the branch of the experiment being participated in, such ascontrol
orexperiment
, or null if no experiment.dimensions['e10s_enabled']
is whether E10s is enabled.dimensions['gfx_compositor']
is the graphics backend compositor used by the program, such asd3d11
,opengl
andsimple
. Null values may be reported asnone
as well.- All of the above fields can potentially be blank, which means "not present". That means that in the actual pings, the corresponding fields were null.
stats
contains the aggregate values that we care about:stats['usage_hours']
is the number of user-hours represented by the aggregate.stats['main_crashes']
is the number of main process crashes represented by the aggregate (or just program crashes, in the non-E10S case).stats['content_crashes']
is the number of content process crashes represented by the aggregate.stats['plugin_crashes']
is the number of plugin process crashes represented by the aggregate.stats['gmplugin_crashes']
is the number of Gecko media plugin (often abbreviatedGMPlugin
) process crashes represented by the aggregate.stats['content_shutdown_crashes']
is the number of content process crashes that were caused by failure to shut down in a timely manner.stats['gpu_crashes']
is the number of GPU process crashes represented by the aggregate.
TODO(harter)
: https://bugzilla.mozilla.org/show_bug.cgi?id=1361862
Crash Summary Reference
Introduction
The crash_summary
table is the most direct representation of a crash ping.
Contents
The crash_summary
table contains one row for each crash ping.
Each column represents one field from the crash ping payload,
though only a subset of all crash ping fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/crash_summary/v1/
crash_summary
is accessible through re:dash.
Here's an example query.
Further Reading
The technical documentation for crash_summary
is located in the
telemetry-batch-view documentation.
The code responsible for generating this dataset is here
Data Reference
Example Queries
Here is an example query to get the total number of main crashes by gfx_compositor
:
select gfx_compositor, count(*)
from crash_summary
where application = 'Firefox'
and (payload.processType IS NULL OR payload.processType = 'main')
group by gfx_compositor
Sampling
CrashSummary
contains one record for every
crash ping
submitted by Firefox.
It was built with the long term goal of providing a base for
CrashAggregates
.
Scheduling
This dataset is updated daily, shortly after midnight UTC. The job is scheduled on telemetry-airflow. The DAG is here.
Schema
root
|-- client_id: string (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- build_version: string (nullable = true)
|-- build_id: string (nullable = true)
|-- channel: string (nullable = true)
|-- application: string (nullable = true)
|-- os_name: string (nullable = true)
|-- os_version: string (nullable = true)
|-- architecture: string (nullable = true)
|-- country: string (nullable = true)
|-- experiment_id: string (nullable = true)
|-- experiment_branch: string (nullable = true)
|-- experiments: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- e10s_enabled: boolean (nullable = true)
|-- gfx_compositor: string (nullable = true)
|-- profile_created: integer (nullable = true)
|-- payload: struct (nullable = true)
| |-- crashDate: string (nullable = true)
| |-- processType: string (nullable = true)
| |-- hasCrashEnvironment: boolean (nullable = true)
| |-- metadata: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
| |-- version: integer (nullable = true)
|-- submission_date: string (nullable = true)
For more detail on where these fields come from in the
raw data,
please look at the case classes
in the CrashSummaryView
code.
Cross Sectional Reference
This data set has been deprecated in favor of Clients Daily
Error Aggregates Reference
Introduction
The error_aggregates_v2
table represents counts of errors counted from main and crash
pings, aggregated every 5 minutes. It is the dataset backing the main mission
control view, but may also be queried
independently.
Contents
The error_aggregates_v2
table contains counts of various error measures (for
example: crashes, "the slow script dialog showing"), aggregated across each
unique set of dimensions (for example: channel, operating system) every 5
minutes. You can get an aggregated count for any particular set of dimensions
by summing using SQL.
Experiment unpacking
It's important to note that when this dataset is written, pings from clients participating in an experiment
are aggregated on the experiment_id
and branch_id
dimensions corresponding to what experiment and branch
they are participating in. However, they are also aggregated with the rest of the population where the values of
these dimensions are null.
Therefore care must be taken when writing aggregating queries over the whole population - in these cases one needs to
filter for experiment_id is null
and branch_id is null
in order to not double-count pings from experiments.
Accessing the data
You can access the data via re:dash. Choose Athena
and then select the
telemetry.error_aggregates_v2
table.
Further Reading
The code responsible for generating this dataset is here.
Data Reference
Example Queries
Getting a large number of different crash measures across many platforms and channels (view on Re:dash):
SELECT window_start,
build_id,
channel,
os_name,
version,
sum(usage_hours) AS usage_hours,
sum(main_crashes) AS main,
sum(content_crashes) AS content,
sum(gpu_crashes) AS gpu,
sum(plugin_crashes) AS plugin,
sum(gmplugin_crashes) AS gmplugin
FROM telemetry.error_aggregates_v2
WHERE application = 'Firefox'
AND (os_name = 'Darwin' or os_name = 'Linux' or os_name = 'Windows_NT')
AND (channel = 'beta' or channel = 'release' or channel = 'nightly' or channel = 'esr')
AND build_id > '201801'
AND window_start > current_timestamp - (1 * interval '24' hour)
AND experiment_id IS NULL
AND branch_id IS NULL
GROUP BY window_start, channel, build_id, version, os_name
Get the number of main_crashes
on Windows over a small interval
(view on Re:dash):
SELECT window_start as time, sum(main_crashes) AS main_crashes
FROM telemetry.error_aggregates_v2
WHERE application = 'Firefox'
AND os_name = 'Windows_NT'
AND channel = 'release'
AND version = '58.0.2'
AND window_start > timestamp '2018-02-21'
AND window_end < timestamp '2018-02-22'
AND experiment_id IS NULL
AND branch_id IS NULL
GROUP BY window_start
Sampling
Data sources
The aggregates in this data source are derived from main, crash and core pings:
- crash pings are used to count/gather main and content crash events, all other errors from desktop clients (including all other crashes) are gathered from main pings
- core pings are used to count usage hours, first subsession and unique client counts.
Scheduling
The error_aggregates
job is run continuously, using the Spark Streaming infrastructure
Schema
The error_aggregates_v2
table has the following columns which define its dimensions:
window_start
: beginning of interval when this sample was takenwindow_end
: end of interval when this sample was taken (will always be 5 minutes more thanwindow_start
for any given row)submission_date_s3
: the date pings were submitted for a particular aggregatechannel
: the channel, likerelease
orbeta
version
: the version e.g.57.0.1
display_version
: like version, but includes beta number if applicable e.g.57.0.1b4
build_id
: theYYYYMMDDhhmmss
timestamp the program was built, like20160123180541
. This is also known as thebuild ID
orbuildid
application
: application name (e.g.Firefox
orFennec
)os_name
: name of the OS (e.g.Darwin
orWindows_NT
)os_version
: version of the OSarchitecture
: build architecture, e.g.x86
country
: country code for the user (determined using geoIP), likeUS
orUK
experiment_id
: identifier of the experiment being participated in, such ase10s-beta46-noapz@experiments.mozilla.org
, null if no experiment or for unpacked rows (see Experiment unpacking)experiment_branch
: the branch of the experiment being participated in, such ascontrol
orexperiment
, null if no experiment or for unpacked rows (see Experiment unpacking)
And these are the various measures we are counting:
usage_hours
: number of usage hours (i.e. total number of session hours reported by the pings in this aggregate, note that this might include time where people are not actively using the browser or their computer is asleep)count
: number of pings processed in this aggregatemain_crashes
: number of main process crashes (or just program crashes, in the non-e10s case)startup_crashes
: number of startup crashescontent_crashes
: number of content process crashes (version => 58
only)gpu_crashes
: number of GPU process crashesplugin_crashes
: number of plugin process crashesgmplugin_crashes
: number of Gecko media plugin (often abbreviatedGMPlugin
) process crashescontent_shutdown_crashes
: number of content process crashes that were caused by failure to shut down in a timely manner (version => 58
only)browser_shim_usage_blocked
: number of times a CPOW shim was blocked from being created by browser codepermissions_sql_corrupted
: number of times the permissions SQL error occurred (beta/nightly only)defective_permissions_sql_removed
: number of times there was a removal of defectivepermissions.sqlite
(beta/nightly only)slow_script_notice_count
: number of times the slow script notice count was shown (beta/nightly only)slow_script_page_count
: number of pages that trigger slow script notices (beta/nightly only)
Introduction
This is a work in progress. The work is being tracked here.
Data Reference
Example Queries
This is a work in progress. The work is being tracked here.
Sampling
The events dataset contains one row for each event in a main ping.
This dataset is derived from main_summary
so any of main_summary
's filters affect this dataset as well.
Data is currently available from 2017-01-05 on.
Scheduling
The events dataset is updated daily, shortly after
main_summary
is updated.
The job is scheduled on Airflow.
The DAG is here.
Firefox events
Firefox has an API to record events, which are then submitted through the main ping. The format and mechanism of event collection in Firefox is documented here.
The full events data pipeline is documented here.
Schema
As of 2017-01-26, the current version of the events
dataset is v1
, and has a schema as follows:
root
|-- document_id: string (nullable = true)
|-- client_id: string (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- country: string (nullable = true)
|-- locale: string (nullable = true)
|-- app_name: string (nullable = true)
|-- app_version: string (nullable = true)
|-- os: string (nullable = true)
|-- os_version: string (nullable = true)
|-- subsession_start_date: string (nullable = true)
|-- subsession_length: long (nullable = true)
|-- sync_configured: boolean (nullable = true)
|-- sync_count_desktop: integer (nullable = true)
|-- sync_count_mobile: integer (nullable = true)
|-- timestamp: long (nullable = true)
|-- sample_id: string (nullable = true)
|-- event_timestamp: long (nullable = false)
|-- event_category: string (nullable = false)
|-- event_method: string (nullable = false)
|-- event_object: string (nullable = false)
|-- event_string_value: string (nullable = true)
|-- event_map_values: map (nullable = true)
| |-- key: string
| |-- value: string
|-- submission_date_s3: string (nullable = true)
|-- doc_type: string (nullable = true)
First Shutdown Summary
Introduction
The first_shutdown_summary
table is a summary of the first-shutdown
ping.
Contents
The first shutdown ping contains first session usage data. The
dataset has rows similar to the
telemetry_new_profile_parquet
,
but in the shape of
main_summary
.
Background and Caveats
Ping latency was reduced through the shutdown ping-sender mechanism in Firefox 55. To maintain consistent historical behavior, the first main ping is not sent until the second start up. In Firefox 57, a separate first-shutdown ping was created to evaluate first-shutdown behavior while maintaining backwards compatibility.
In many cases, the first-shutdown ping is a duplicate of the main ping. The first-shutdown summary can be used in conjunction with the main summary by taking the union and deduplicating on the document_id
.
Accessing the Data
The data can be accessed as first_shutdown_summary
. It is currently stored in the following path.
s3://telemetry-parquet/first_shutdown_summary/v4/
The data is backfilled to 2017-09-22, the date of its first nightly appearance. This data should be available to all releases on and after Firefox 57.
Code Reference
This dataset is generated by telemetry-batch-view.
Longitudinal Reference
Introduction
The longitudinal
dataset is a 1% sample of main ping data
organized so that each row corresponds to a client_id
.
If you're not sure which dataset to use for your analysis,
this is probably what you want.
Contents
Each row in the longitudinal
dataset represents one client_id
,
which is approximately a user.
Each column represents a field from the main ping.
Most fields contain arrays of values, with one value for each ping associated with a client_id
.
Using arrays give you access to the raw data from each ping,
but can be difficult to work with from SQL.
Here's a query showing some sample data
to help illustrate.
Take a look at the longitudinal examples if you get stuck.
Background and Caveats
Think of the longitudinal table as wide and short.
The dataset contains more columns than main_summary
and down-samples to 1% of all clients to reduce query computation time and save resources.
In summary, the longitudinal table differs from main_summary
in two important ways:
- The longitudinal dataset groups all data so that one row represents a
client_id
- The longitudinal dataset samples to 1% of all
client_id
s
Please note that this dataset only contains release (or opt-out) histograms and scalars.
Accessing the Data
The longitudinal
is available in re:dash,
though it can be difficult to work with the array values in SQL.
Take a look at this example query.
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/longitudinal/
Data Reference
Example Queries
Take a look at the Longitudinal Examples Cookbook.
Sampling
Pings Within Last 6 Months
The longitudinal
filters to main
pings from within the last 6 months.
1% Sample
The longitudinal dataset samples down to 1% of all clients in the above sample. The sample is generated by the following process:
- hash the
client_id
for each ping from the last 6 months. - project that hash onto an integer from 1:100, inclusive
- filter to pings with
client_id
s matching a 'magic number' (in this case 42)
This process has a couple of nice properties:
- The sample is consistent over time.
The
longitudinal
dataset is regenerated weekly. The clients included in each run are very similar with this process. The only change will come from never-before-seen clients, or clients without a ping in the last 180 days. - We don't need to adjust the sample as new clients enter or exit our pool.
More practically,
the sample is created by filtering to pings with main_summary.sample_id == 42
.
If you're working with main_summary
,
you can recreate this sample by doing this filter manually.
Scheduling
The longitudinal
job is run weekly, early on Sunday morning UTC.
The job is scheduled on Airflow.
The DAG is here.
Schema
TODO(harter)
: https://bugzilla.mozilla.org/show_bug.cgi?id=1361862
Code Reference
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.
Main Summary
Introduction
The main_summary
table is the most direct representation of a main ping
but can be difficult to work with due to its size.
Prefer the longitudinal
dataset unless using the sampled data is prohibitive.
Contents
The main_summary
table contains one row for each ping.
Each column represents one field from the main ping payload,
though only a subset of all main ping fields are included.
This dataset does not include histograms.
Background and Caveats
This table is massive, and due to its size, it can be difficult to work with.
You should avoid querying main_summary
from re:dash.
Your queries will be slow to complete and can impact performance for other users,
since re:dash on a shared cluster.
Instead, we recommend using the longitudinal
or clients_daily
dataset where possible.
If these datasets do not suffice, consider using Spark on an
ATMO cluster.
In the odd case where these queries are necessary,
make use of the sample_id
field and limit to a short submission date range.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/main_summary/v4/
Though not recommended main_summary
is accessible through re:dash.
Here's an example query.
Your queries will be slow to complete and can impact performance for other users,
since re:dash is on a shared cluster.
Further Reading
The technical documentation for main_summary
is located in the
telemetry-batch-view documentation.
The code responsible for generating this dataset is here
Adding New Fields
We support a few basic types that can be easily added to main_summary
.
Non-addon scalars are automatically added to main_summary
.
User Preferences
These are added in the userPrefsList
, near the top of the
Main Summary file.
They must be available in the ping environment
to be included here. There is more information in the file itself.
Once added, they will show as top-level fields, with the string user_pref
prepended. For example, IntegerUserPref("dom.ipc.processCount")
becomes user_pref_dom_ipc_processcount
.
Histograms
Histograms can simply be added to the histogramsWhitelist
near the top of
Main Summary file.
Simply add the name of the histogram in the alphabetically-sorted position in the list.
Each process a histogram is recorded in will have a column in main_summary
, with the string histogram_
prepended. For example, CYCLE_COLLECTOR_MAX_PAUSE
is recorded in the parent
, content
, and gpu
processes (according to the definition).
It will then result in three columns:
histogram_parent_cycle_collector_max_pause
histogram_content_cycle_collector_max_pause
histogram_gpu_cycle_collector_max_pause
Addon Scalars
Addon scalars are recorded by an addon. To include one of these, add the definition to the addon scalars definition file in telemetry-batch-view. Be sure to include the section:
record_in_processes:
- 'dynamic'
The addon scalars can then be found in the associated column, depending on their type:
string_addon_scalars
keyed_string_addon_scalars
uint_addon_scalars
keyed_uint_addon_scalars
boolean_addon_scalars
keyed_boolean_addon_scalars
These columns are all maps. Each addon scalar will be a key within that map, concatenating the top-level subsection within Scalars.yaml
with its name to get the key. As an example, consider the following scalar definition:
test:
misunderestimated_nucular:
description: A test scalar, no soup for you!
expires: never
kind: string
keyed: true
notification_emails:
- frank@mozilla.com
record_in_processes:
- 'dynamic'
For example, you could find the addon scalar test.misunderestimated_nucular
, a keyed string scalar, using keyed_string_addon_scalars['test_misunderestimated_nucular']
.
In general, use element_at
, which returns NULL
when the key is not found: element_at(keyed_string_addon_scalars, 'test_misunderestimated_nucular')
Other Fields
We can include other types of fields as well, for example if there needs to be a specific transformation done. We do need the data to be available in the Main Ping
Data Reference
Example Queries
We recommend working with this dataset via Spark rather than sql.t.m.o
.
Due to the large number of records,
queries can consume a lot of resources on the
shared cluster and impact other users.
Queries via sql.t.m.o
should limit to a short submission_date_s3
range,
and ideally make use of the sample_id
field.
When using Presto to query the data from sql.t.m.o
,
you can use the UNNEST
feature to access items in the
search_counts
, popup_notification_stats
and active_addons
fields.
For example, to compare the search volume for different search source values, you could use:
WITH search_data AS (
SELECT
s.source AS search_source,
s.count AS search_count
FROM
main_summary
CROSS JOIN UNNEST(search_counts) AS t(s)
WHERE
submission_date_s3 = '20160510'
AND sample_id = '42'
AND search_counts IS NOT NULL
)
SELECT
search_source,
sum(search_count) as total_searches
FROM search_data
GROUP BY search_source
ORDER BY sum(search_count) DESC
Sampling
The main_summary
dataset contains one record for each main
ping
as long as the record contains a non-null value for
documentId
, submissionDate
, and Timestamp
.
We do not ever expect nulls for these fields.
Scheduling
This dataset is updated daily via the telemetry-airflow infrastructure. The job DAG runs every day shortly after midnight UTC. You can find the job definition here
Schema
As of 2017-12-03, the current version of the main_summary
dataset is v4
, and has a schema as follows:
root
|-- document_id: string (nullable = false)
|-- client_id: string (nullable = true)
|-- channel: string (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- normalized_os_version: string (nullable = true)
|-- country: string (nullable = true)
|-- city: string (nullable = true)
|-- geo_subdivision1: string (nullable = true)
|-- geo_subdivision2: string (nullable = true)
|-- os: string (nullable = true)
|-- os_version: string (nullable = true)
|-- os_service_pack_major: long (nullable = true)
|-- os_service_pack_minor: long (nullable = true)
|-- windows_build_number: long (nullable = true)
|-- windows_ubr: long (nullable = true)
|-- install_year: long (nullable = true)
|-- is_wow64: boolean (nullable = true)
|-- memory_mb: integer (nullable = true)
|-- cpu_count: integer (nullable = true)
|-- cpu_cores: integer (nullable = true)
|-- cpu_vendor: string (nullable = true)
|-- cpu_family: integer (nullable = true)
|-- cpu_model: integer (nullable = true)
|-- cpu_stepping: integer (nullable = true)
|-- cpu_l2_cache_kb: integer (nullable = true)
|-- cpu_l3_cache_kb: integer (nullable = true)
|-- cpu_speed_mhz: integer (nullable = true)
|-- gfx_features_d3d11_status: string (nullable = true)
|-- gfx_features_d2d_status: string (nullable = true)
|-- gfx_features_gpu_process_status: string (nullable = true)
|-- gfx_features_advanced_layers_status: string (nullable = true)
|-- apple_model_id: string (nullable = true)
|-- antivirus: array (nullable = true)
| |-- element: string (containsNull = false)
|-- antispyware: array (nullable = true)
| |-- element: string (containsNull = false)
|-- firewall: array (nullable = true)
| |-- element: string (containsNull = false)
|-- profile_creation_date: long (nullable = true)
|-- profile_reset_date: long (nullable = true)
|-- previous_build_id: string (nullable = true)
|-- session_id: string (nullable = true)
|-- subsession_id: string (nullable = true)
|-- previous_session_id: string (nullable = true)
|-- previous_subsession_id: string (nullable = true)
|-- session_start_date: string (nullable = true)
|-- subsession_start_date: string (nullable = true)
|-- session_length: long (nullable = true)
|-- subsession_length: long (nullable = true)
|-- subsession_counter: integer (nullable = true)
|-- profile_subsession_counter: integer (nullable = true)
|-- creation_date: string (nullable = true)
|-- distribution_id: string (nullable = true)
|-- submission_date: string (nullable = false)
|-- sync_configured: boolean (nullable = true)
|-- sync_count_desktop: integer (nullable = true)
|-- sync_count_mobile: integer (nullable = true)
|-- app_build_id: string (nullable = true)
|-- app_display_version: string (nullable = true)
|-- app_name: string (nullable = true)
|-- app_version: string (nullable = true)
|-- timestamp: long (nullable = false)
|-- env_build_id: string (nullable = true)
|-- env_build_version: string (nullable = true)
|-- env_build_arch: string (nullable = true)
|-- e10s_enabled: boolean (nullable = true)
|-- e10s_multi_processes: long (nullable = true)
|-- locale: string (nullable = true)
|-- update_channel: string (nullable = true)
|-- update_enabled: boolean (nullable = true)
|-- update_auto_download: boolean (nullable = true)
|-- attribution: struct (nullable = true)
| |-- source: string (nullable = true)
| |-- medium: string (nullable = true)
| |-- campaign: string (nullable = true)
| |-- content: string (nullable = true)
|-- sandbox_effective_content_process_level: integer (nullable = true)
|-- active_experiment_id: string (nullable = true)
|-- active_experiment_branch: string (nullable = true)
|-- reason: string (nullable = true)
|-- timezone_offset: integer (nullable = true)
|-- plugin_hangs: integer (nullable = true)
|-- aborts_plugin: integer (nullable = true)
|-- aborts_content: integer (nullable = true)
|-- aborts_gmplugin: integer (nullable = true)
|-- crashes_detected_plugin: integer (nullable = true)
|-- crashes_detected_content: integer (nullable = true)
|-- crashes_detected_gmplugin: integer (nullable = true)
|-- crash_submit_attempt_main: integer (nullable = true)
|-- crash_submit_attempt_content: integer (nullable = true)
|-- crash_submit_attempt_plugin: integer (nullable = true)
|-- crash_submit_success_main: integer (nullable = true)
|-- crash_submit_success_content: integer (nullable = true)
|-- crash_submit_success_plugin: integer (nullable = true)
|-- shutdown_kill: integer (nullable = true)
|-- active_addons_count: long (nullable = true)
|-- flash_version: string (nullable = true)
|-- vendor: string (nullable = true)
|-- is_default_browser: boolean (nullable = true)
|-- default_search_engine_data_name: string (nullable = true)
|-- default_search_engine_data_load_path: string (nullable = true)
|-- default_search_engine_data_origin: string (nullable = true)
|-- default_search_engine_data_submission_url: string (nullable = true)
|-- default_search_engine: string (nullable = true)
|-- devtools_toolbox_opened_count: integer (nullable = true)
|-- client_submission_date: string (nullable = true)
|-- client_clock_skew: long (nullable = true)
|-- client_submission_latency: long (nullable = true)
|-- places_bookmarks_count: integer (nullable = true)
|-- places_pages_count: integer (nullable = true)
|-- push_api_notify: integer (nullable = true)
|-- web_notification_shown: integer (nullable = true)
|-- popup_notification_stats: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- offered: integer (nullable = true)
| | |-- action_1: integer (nullable = true)
| | |-- action_2: integer (nullable = true)
| | |-- action_3: integer (nullable = true)
| | |-- action_last: integer (nullable = true)
| | |-- dismissal_click_elsewhere: integer (nullable = true)
| | |-- dismissal_leave_page: integer (nullable = true)
| | |-- dismissal_close_button: integer (nullable = true)
| | |-- dismissal_not_now: integer (nullable = true)
| | |-- open_submenu: integer (nullable = true)
| | |-- learn_more: integer (nullable = true)
| | |-- reopen_offered: integer (nullable = true)
| | |-- reopen_action_1: integer (nullable = true)
| | |-- reopen_action_2: integer (nullable = true)
| | |-- reopen_action_3: integer (nullable = true)
| | |-- reopen_action_last: integer (nullable = true)
| | |-- reopen_dismissal_click_elsewhere: integer (nullable = true)
| | |-- reopen_dismissal_leave_page: integer (nullable = true)
| | |-- reopen_dismissal_close_button: integer (nullable = true)
| | |-- reopen_dismissal_not_now: integer (nullable = true)
| | |-- reopen_open_submenu: integer (nullable = true)
| | |-- reopen_learn_more: integer (nullable = true)
|-- search_counts: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- engine: string (nullable = true)
| | |-- source: string (nullable = true)
| | |-- count: long (nullable = true)
|-- active_addons: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- addon_id: string (nullable = false)
| | |-- blocklisted: boolean (nullable = true)
| | |-- name: string (nullable = true)
| | |-- user_disabled: boolean (nullable = true)
| | |-- app_disabled: boolean (nullable = true)
| | |-- version: string (nullable = true)
| | |-- scope: integer (nullable = true)
| | |-- type: string (nullable = true)
| | |-- foreign_install: boolean (nullable = true)
| | |-- has_binary_components: boolean (nullable = true)
| | |-- install_day: integer (nullable = true)
| | |-- update_day: integer (nullable = true)
| | |-- signed_state: integer (nullable = true)
| | |-- is_system: boolean (nullable = true)
| | |-- is_web_extension: boolean (nullable = true)
| | |-- multiprocess_compatible: boolean (nullable = true)
|-- disabled_addons_ids: array (nullable = true)
| |-- element: string (containsNull = false)
|-- active_theme: struct (nullable = true)
| |-- addon_id: string (nullable = false)
| |-- blocklisted: boolean (nullable = true)
| |-- name: string (nullable = true)
| |-- user_disabled: boolean (nullable = true)
| |-- app_disabled: boolean (nullable = true)
| |-- version: string (nullable = true)
| |-- scope: integer (nullable = true)
| |-- type: string (nullable = true)
| |-- foreign_install: boolean (nullable = true)
| |-- has_binary_components: boolean (nullable = true)
| |-- install_day: integer (nullable = true)
| |-- update_day: integer (nullable = true)
| |-- signed_state: integer (nullable = true)
| |-- is_system: boolean (nullable = true)
| |-- is_web_extension: boolean (nullable = true)
| |-- multiprocess_compatible: boolean (nullable = true)
|-- blocklist_enabled: boolean (nullable = true)
|-- addon_compatibility_check_enabled: boolean (nullable = true)
|-- telemetry_enabled: boolean (nullable = true)
|-- user_prefs: struct (nullable = true)
| |-- dom_ipc_process_count: integer (nullable = true)
| |-- extensions_allow_non_mpc_extensions: boolean (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- timestamp: long (nullable = false)
| | |-- category: string (nullable = false)
| | |-- method: string (nullable = false)
| | |-- object: string (nullable = false)
| | |-- string_value: string (nullable = true)
| | |-- map_values: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
|-- ssl_handshake_result_success: integer (nullable = true)
|-- ssl_handshake_result_failure: integer (nullable = true)
|-- ssl_handshake_result: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
|-- active_ticks: integer (nullable = true)
|-- main: integer (nullable = true)
|-- first_paint: integer (nullable = true)
|-- session_restored: integer (nullable = true)
|-- total_time: integer (nullable = true)
|-- plugins_notification_shown: integer (nullable = true)
|-- plugins_notification_user_action: struct (nullable = true)
| |-- allow_now: integer (nullable = true)
| |-- allow_always: integer (nullable = true)
| |-- block: integer (nullable = true)
|-- plugins_infobar_shown: integer (nullable = true)
|-- plugins_infobar_block: integer (nullable = true)
|-- plugins_infobar_allow: integer (nullable = true)
|-- plugins_infobar_dismissed: integer (nullable = true)
|-- experiments: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- search_cohort: string (nullable = true)
|-- gfx_compositor: string (nullable = true)
|-- quantum_ready: boolean (nullable = true)
|-- gc_max_pause_ms_main_above_150: long (nullable = true)
|-- gc_max_pause_ms_main_above_250: long (nullable = true)
|-- gc_max_pause_ms_main_above_2500: long (nullable = true)
|-- gc_max_pause_ms_content_above_150: long (nullable = true)
|-- gc_max_pause_ms_content_above_250: long (nullable = true)
|-- gc_max_pause_ms_content_above_2500: long (nullable = true)
|-- cycle_collector_max_pause_main_above_150: long (nullable = true)
|-- cycle_collector_max_pause_main_above_250: long (nullable = true)
|-- cycle_collector_max_pause_main_above_2500: long (nullable = true)
|-- cycle_collector_max_pause_content_above_150: long (nullable = true)
|-- cycle_collector_max_pause_content_above_250: long (nullable = true)
|-- cycle_collector_max_pause_content_above_2500: long (nullable = true)
|-- input_event_response_coalesced_ms_main_above_150: long (nullable = true)
|-- input_event_response_coalesced_ms_main_above_250: long (nullable = true)
|-- input_event_response_coalesced_ms_main_above_2500: long (nullable = true)
|-- input_event_response_coalesced_ms_content_above_150: long (nullable = true)
|-- input_event_response_coalesced_ms_content_above_250: long (nullable = true)
|-- input_event_response_coalesced_ms_content_above_2500: long (nullable = true)
|-- ghost_windows_main_above_1: long (nullable = true)
|-- ghost_windows_content_above_1: long (nullable = true)
|-- user_pref_dom_ipc_plugins_sandbox_level_flash: integer (nullable = true)
|-- user_pref_dom_ipc_processcount: integer (nullable = true)
|-- user_pref_extensions_allow_non_mpc_extensions: boolean (nullable = true)
|-- user_pref_extensions_legacy_enabled: boolean (nullable = true)
|-- user_pref_browser_search_widget_innavbar: boolean (nullable = true)
|-- user_pref_general_config_filename: string (nullable = true)
|-- ** dynamically included scalar fields, see source **
|-- ** dynamically included whitelisted histograms, see source **
|-- boolean_addon_scalars: map (nullable = true)
| |-- key: string
| |-- value: boolean (valueContainsNull = true)
|-- keyed_boolean_addon_scalars: map (nullable = true)
| |-- key: string
| |-- value: map (valueContainsNull = true)
| | |-- key: string
| | |-- value: boolean (valueContainsNull = true)
|-- string_addon_scalars: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- keyed_string_addon_scalars: map (nullable = true)
| |-- key: string
| |-- value: map (valueContainsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- uint_addon_scalars: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
|-- keyed_uint_addon_scalars: map (nullable = true)
| |-- key: string
| |-- value: map (valueContainsNull = true)
| | |-- key: string
| | |-- value: integer (valueContainsNull = true)
|-- submission_date_s3: string (nullable = true)
|-- sample_id: string (nullable = true)
For more detail on where these fields come from in the
raw data,
please look
in the MainSummaryView
code.
in the buildSchema
function.
Most of the fields are simple scalar values, with a few notable exceptions:
- The
search_count
field is an array of structs, each item in the array representing a 3-tuple of (engine
,source
,count
). Theengine
field represents the name of the search engine against which the searches were done. Thesource
field represents the part of the Firefox UI that was used to perform the search. It contains values such asabouthome
,urlbar
, andsearchbar
. Thecount
field contains the number of searches performed against this engine+source combination during that subsession. Any of the fields in the struct may be null (for example if the search key did not match the expected pattern, or if the count was non-numeric). - The
loop_activity_counter
field is a simple struct containing inner fields for each expected value of theLOOP_ACTIVITY_COUNTER
Enumerated Histogram. Each inner field is a count for that histogram bucket. - The
popup_notification_stats
field is a map ofString
keys to struct values, each field in the struct being a count for the expected values of thePOPUP_NOTIFICATION_STATS
Keyed Enumerated Histogram. - The
places_bookmarks_count
andplaces_pages_count
fields contain the mean value of the corresponding Histogram, which can be interpreted as the average number of bookmarks or pages in a given subsession. - The
active_addons
field contains an array of structs, one for each entry in theenvironment.addons.activeAddons
section of the payload. More detail in Bug 1290181. - The
disabled_addons_ids
field contains an array of strings, one for each entry in thepayload.addonDetails
which is not already reported in theenvironment.addons.activeAddons
section of the payload. More detail in Bug 1390814. Please note that while using this field is generally OK, this was introduced to support the TAAR project and you should not count on it in the future. The field can stay in themain_summary
, but we might need to slightly change the ping structure to something better thanpayload.addonDetails
. - The
theme
field contains a single struct in the same shape as the items in theactive_addons
array. It contains information about the currently active browser theme. - The
user_prefs
field contains a struct with values for preferences of interest. - The
events
field contains an array of event structs. - Dynamically-included histogram fields are present as key->value maps, or key->(key->value) nested maps for keyed histograms.
Time formats
Columns in main_summary
may use one of a handful of time formats with different precisions:
Column Name | Origin | Description | Example | Spark | Presto |
---|---|---|---|---|---|
timestamp | stamped at ingestion | nanoseconds since epoch | 1504689165972861952 | from_unixtime(timestamp/1e9) | from_unixtime(timestamp/1e9) |
submission_date_s3 | derived from timestamp | YYYYMMDD date string of timestamp | 20170906 | from_unixtime(unix_timestamp(submission_date, 'yyyyMMdd')) | date_parse(submission_date, '%Y%m%d') |
client_submission_date | derived from HTTP header: Fields[Date] | HTTP date header string sent with the ping | Tue, 27 Sep 2016 16:28:23 GMT | unix_timestamp(client_submission_date, 'EEE, dd M yyyy HH:mm:ss zzz') | date_parse(substr(client_submission_date, 1, 25), '%a, %d %b %Y %H:%i:%s') |
creation_date | creationDate | time of ping creation ISO8601 at UTC+0 | 2017-09-06T08:21:36.002Z | to_timestamp(creation_date, "yyyy-MM-dd'T'HH:mm:ss.SSSXXX") | from_iso8601_timestamp(creation_date) AT TIME ZONE 'GMT' |
timezone_offset | info.timezoneOffset | timezone offset in minutes | 120 | ||
subsession_start_date | info.subsessionStartDate | hourly precision, ISO8601 date in local time | 2017-09-06T00:00:00.0+02:00 | from_iso8601_timestamp(subsession_start_date) AT TIME ZONE 'GMT' | |
subsession_length | info.subsessionLength | subsession length in seconds | 599 | date_add('second', subsession_length, subsession_start_date) | |
profile_creation_date | environment.profile.creationDate | days since epoch | 15,755 | from_unixtime(profile_creation_date * 86400) |
Code Reference
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.
New Profile
Introduction
The telemetry_new_profile_parquet
table is the most direct representation of a new-profile ping.
Contents
The table contains one row for each ping. Each column represents one field from the new-profile ping payload, though only a subset of all fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-new-profile-parquet/v2/
The telemetry_new_profile_parquet
is accessible through re:dash.
Here's an example query.
Further Reading
This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1360256.
Data Reference
Schema
As of 2018-06-26, the current version of the telemetry_new_profile_parquet
dataset is v2
, and has a schema as follows:
root
|-- id: string (nullable = true)
|-- client_id: string (nullable = true)
|-- metadata: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- date: string (nullable = true)
| |-- normalized_channel: string (nullable = true)
| |-- geo_country: string (nullable = true)
| |-- geo_city: string (nullable = true)
| |-- geo_subdivision1: string (nullable = true)
| |-- geo_subdivision2: string (nullable = true)
| |-- creation_timestamp: long (nullable = true)
| |-- x_ping_sender_version: string (nullable = true)
|-- environment: struct (nullable = true)
| |-- build: struct (nullable = true)
| | |-- application_name: string (nullable = true)
| | |-- architecture: string (nullable = true)
| | |-- version: string (nullable = true)
| | |-- build_id: string (nullable = true)
| | |-- vendor: string (nullable = true)
| | |-- hotfix_version: string (nullable = true)
| |-- partner: struct (nullable = true)
| | |-- distribution_id: string (nullable = true)
| | |-- distribution_version: string (nullable = true)
| | |-- partner_id: string (nullable = true)
| | |-- distributor: string (nullable = true)
| | |-- distributor_channel: string (nullable = true)
| | |-- partner_names: array (nullable = true)
| | | |-- element: string (containsNull = true)
| |-- settings: struct (nullable = true)
| | |-- is_default_browser: boolean (nullable = true)
| | |-- default_search_engine: string (nullable = true)
| | |-- default_search_engine_data: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- load_path: string (nullable = true)
| | | |-- origin: string (nullable = true)
| | | |-- submission_url: string (nullable = true)
| | |-- telemetry_enabled: boolean (nullable = true)
| | |-- locale: string (nullable = true)
| | |-- attribution: struct (nullable = true)
| | | |-- source: string (nullable = true)
| | | |-- medium: string (nullable = true)
| | | |-- campaign: string (nullable = true)
| | | |-- content: string (nullable = true)
| | |-- update: struct (nullable = true)
| | | |-- channel: string (nullable = true)
| | | |-- enabled: boolean (nullable = true)
| | | |-- auto_download: boolean (nullable = true)
| |-- system: struct (nullable = true)
| | |-- os: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- version: string (nullable = true)
| | | |-- locale: string (nullable = true)
| |-- profile: struct (nullable = true)
| | |-- creation_date: long (nullable = true)
|-- payload: struct (nullable = true)
| |-- reason: string (nullable = true)
|-- submission: string (nullable = true)
For more detail on the raw ping these fields come from, see the raw data.
1 Day Retention
Introduction
The retention
table provides client counts relevant to client retention at a
1-day granularity. The project is tracked in Bug 1381840
Contents
The retention
table contains a set of attribute columns used to specify a
cohort of users and a set of metric columns to describe cohort activity. Each
row contains a permutation of attributes, an approximate set of clients in a
cohort, and the aggregate engagement metrics.
This table uses the HyperLogLog (HLL) sketch to create an approximate set of
clients in a cohort. HLL allows counting across overlapping cohorts in a single
pass while avoiding the problem of double counting. This data-structure has the
benefit of being compact and performant in the context of retention analysis,
at the expense of precision. For example, calculating a 7-day retention period
can be obtained by aggregating over a week of retention data using the union
operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT
over client_id
's in the 7-day window.
Background and Caveats
- The data starts at 2017-03-06, the merge date where Nightly started to
track Firefox 55 in Mozilla-Central. However, there was
not a consistent view into the behavior of first session profiles until the
new_profile
ping. This means much of the data is inaccurate before 2017-06-26. - This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
- Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
- The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.
Accessing the Data
The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.
The column should first be cast to the HLL type. The scalar
cardinality(<hll_column>)
function will approximate the number of unique
items per HLL object. The aggregate merge(<hll_column>)
function will perform
the set union between all objects in a column.
Example: Cast the count column into the appropriate type.
SELECT cast(hll as HLL) as n_profiles_hll FROM retention
Count the number of clients seen over all attribute combinations.
SELECT cardinality(cast(hll as HLL)) FROM retention
Group-by and aggregate client counts over different release channels.
SELECT channel, cardinality(merge(cast(hll AS HLL))
FROM retention
GROUP BY channel
The HyperLogLog library wrappers are available for use outside of the
configured STMO environment, spark-hyperloglog
and
presto-hyperloglog
.
Also see the client_count_daily
dataset.
Data Reference
Example Queries
See the Example Usage Dashboard for more usages of datasets of the same shape.
Scheduling
The job is scheduled on Airflow on a daily basis after main_summary
is run
for the day. This job requires both mozetl
and telemetry-batch-view
as
dependencies.
Schema
As of 2017-10-10, the current version of retention
is v1
and has a schema
as follows:
root
|-- subsession_start: string (nullable = true)
|-- profile_creation: string (nullable = true)
|-- days_since_creation: long (nullable = true)
|-- channel: string (nullable = true)
|-- app_version: string (nullable = true)
|-- geo: string (nullable = true)
|-- distribution_id: string (nullable = true)
|-- is_funnelcake: boolean (nullable = true)
|-- source: string (nullable = true)
|-- medium: string (nullable = true)
|-- content: string (nullable = true)
|-- sync_usage: string (nullable = true)
|-- is_active: boolean (nullable = true)
|-- hll: binary (nullable = true)
|-- usage_hours: double (nullable = true)
|-- sum_squared_usage_hours: double (nullable = true)
|-- total_uri_count: long (nullable = true)
|-- unique_domains_count: double (nullable = true)
Code Reference
The ETL script for processing the data before aggregation is found in
mozetl.engagement.retention
. The aggregate job is found in
telemetry-batch-view as the RetentionView
.
The runner script performs all the necessary setup to run on EMR. This script can be used to perform backfill.
Socorro Crash Reports
Introduction
Public crash statistics for Firefox are available through the Data Platform in a socorro_crash
dataset.
The crash data in Socorro is sanitized and made available to ATMO and STMO.
A nightly import job converts batches of JSON documents into a columnar format using the associated JSON Schema.
Contents
Accessing the Data
The dataset is available in parquet at s3://telemetry-parquet/socorro_crash/v2
.
It is also indexed with Athena and Presto with the table name socorro_crash
.
Data Reference
Example
The dataset can be queried using SQL. For example, we can aggregate the number of crashes and total up-time by date and reason.
SELECT crash_date,
reason,
count(*) as n_crashes,
avg(uptime) as avg_uptime,
stddev(uptime) as stddev_uptime,
approx_percentile(uptime, ARRAY [0.25, 0.5, 0.75]) as qntl_uptime
FROM socorro_crash
WHERE crash_date='20180520'
GROUP BY 1,
2
Scheduling
The job is schedule on a nightly basis on airflow.
The dag is available under mozilla/telemetry-airflow:/dags/socorro_import.py
.
Schema
The source schema is available on the mozilla/socorro
GitHub repository.
This schema is transformed into a Spark-SQL structure and serialized to parquet after transforming column names from camelCase
to snake_case
.
Code Reference
The code is a notebook in the mozilla-services/data-pipeline
repository.
SSL Ratios
Introduction
The public SSL dataset publishes the percentage of page loads Firefox users have performed that were conducted over SSL. This dataset is used to produce graphs like Let's Encrypt's to determine SSL adoption on the Web over time.
Content
The public SSL dataset is a table where each row is a distinct set of dimensions, with their
associated SSL statistics. The dimensions are submission_date
, os
, and country
. The
statistics are reporting_ratio
, normalized_pageloads
, and ratio
.
Background and Caveats
- We're using normalized values in
normalized_pageloads
to obscure absolute page load counts. - This is across the entirety of release, not per-version, because we're looking at Web health, not Firefox user health.
- Any dimension tuple (any given combination of
submission_date
,os
, andcountry
) with fewer than 5000 page loads is omitted from the dataset. - This is hopefully just a temporary dataset to stopgap release aggregates going away until we can come up with a better way to publicly publish datasets.
Accessing the Data
For details on accessing the data, please look at bug 1414839.
Data Reference
Combining Rows
This is a dataset of ratios. You can't combine ratios if they have different bases. For example, if 50% of 10 loads (5 loads) were SSL and 5% of 20 loads (1 load) were SSL, you cannot calculate that 20% (6 loads) of the total loads (30 loads) were SSL unless you know that the 50% was for 10 and the 5% was for 20.
If you're reluctant, for product reasons, to share the numbers 10 and 20, this gets tricky.
So what we've done is normalize the whole batch of 30 down to 1. That means we tell you that 50% of one-third of the loads (0.333...) was SSL and 5% of the other two-thirds of the loads (0.666...) was SSL. Then you can figure out the overall 20% figure by this calculation:
0.5 * 0.333 + 0.05 * 0.666 = 0.2
For this dataset the same rule applies. To combine rows' ratios (to, for example, see what the
SSL ratio was across all os
and country
for a given submission_date
), you must first
multiply them by the rows' normalized_pageviews
values.
Or, in JavaScript:
let rows = query_result.data.rows;
let ratioForDateInQuestion = rows
.filter(row => row.submission_date == dateInQuestion)
.reduce((row, acc) => acc + row.normalized_pageloads * row.ratio, 0);
Schema
The data is output in re:dash API format:
"query_result": {
"retrieved_at": <timestamp>,
"query_hash": <hash>,
"query": <SQL>,
"runtime": <number of seconds>,
"id": <an id>,
"data_source_id": 26, // Athena
"data_scanned": <some really large number, as a string>,
"data": {
"data_scanned": <some really large number, as a number>,
"columns": [
{"friendly_name": "submission_date", "type": "datetime", "name": "submission_date"},
{"friendly_name": "os", "type": "string", "name": "os"},
{"friendly_name": "country", "type": "string", "name": "country"},
{"friendly_name": "reporting_ratio", "type": "float", "name": "reporting_ratio"},
{"friendly_name": "normalized_pageloads", "type": "float", "name": "normalized_pageloads"},
{"friendly_name": "ratio", "type": "float", "name": "ratio"}
],
"rows": [
{
"submission_date": "2017-10-24T00:00:00", // date string, day resolution
"os": "Windows_NT", // operating system family of the clients reporting the pageloads. One of "Windows_NT", "Linux", or "Darwin".
"country": "CZ", // ISO 639 two-character country code, or "??" if we have no idea. Determined by performing a geo-IP lookup of the clients that submitted the pings.
"reporting_ratio": 0.006825266611977031, // the ratio of pings that reported any pageloads at all. A number between 0 and 1. See [bug 1413258](https://bugzilla.mozilla.org/show_bug.cgi?id=1413258).
"normalized_pageloads": 0.00001759145263985348, // the proportion of total pageloads in the dataset that are represented by this row. Provided to allow combining rows. A number between 0 and 1.
"ratio": 0.6916961976822144 // the ratio of the pageloads that were performed over SSL. A number between 0 and 1.
}, ...
]
}
}
Scheduling
The dataset updates every 24 hours.
Code Reference
You can find the query that generates the SSL dataset here.
Sync Summary Reference
Introduction
Work in progress. Work is being tracked here.
Data Reference
Example Queries
Work in progress. Work is being tracked here
Sampling
Work in progress. Work is being tracked here
Scheduling
This dataset is updated daily, shortly after midnight UTC. The job is scheduled on Airflow. The DAG is here.
Schema
root
|-- app_build_id: string (nullable = true)
|-- app_display_version: string (nullable = true)
|-- app_name: string (nullable = true)
|-- app_version: string (nullable = true)
|-- app_channel: string (nullable = true)
|-- uid: string
|-- device_id: string (nullable = true)
|-- when: integer
|-- took: integer
|-- why: string (nullable = true)
|-- failure_reason: struct (nullable = true)
| |-- name: string
| |-- value: string (nullable = true)
|-- status: struct (nullable = true)
| |-- sync: string (nullable = true)
| |-- status: string (nullable = true)
|-- devices: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- id: string
| | |-- os: string
| | |-- version: string
|-- engines: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- name: string
| | |-- took: integer
| | |-- status: string (nullable = true)
| | |-- failure_reason: struct (nullable = true)
| | | |-- name: string
| | | |-- value: string (nullable = true)
| | |-- incoming: struct (nullable = true)
| | | |-- applied: integer
| | | |-- failed: integer
| | | |-- new_failed: integer
| | | |-- reconciled: integer
| | |-- outgoing: array (nullable = true)
| | | |-- element: struct (containsNull = false)
| | | | |-- sent: integer
| | | | |-- failed: integer
| | |-- validation: struct (containsNull = false)
| | | |-- version: integer
| | | |-- checked: integer
| | | |-- took: integer
| | | |-- failure_reason: struct (nullable = true)
| | | | |-- name: string
| | | | |-- value: string (nullable = true)
| | | |-- problems: array (nullable = true)
| | | | |-- element: struct (containsNull = false)
| | | | | |-- name: string
| | | | | |-- count: integer
Update
Introduction
The update ping
is sent from Firefox Desktop when a browser update is ready to be applied and after it was correctly applied.
It contains the build information and the update blob information, in addition to some information about the
user environment.
The telemetry_update_parquet
table is the most direct representation of an update ping.
Contents
The table contains one row for each ping. Each column represents one field from the update ping payload, though only a subset of all fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-update-parquet/v1/
The telemetry_update_parquet
is accessible through re:dash.
Here's an example query.
Further Reading
This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1384861.
Data Reference
Schema
As of 2017-09-07, the current version of the telemetry_update_parquet
dataset is v1
, and has a schema as follows:
root
|-- id: string (nullable = true)
|-- client_id: string (nullable = true)
|-- metadata: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- date: string (nullable = true)
| |-- normalized_channel: string (nullable = true)
| |-- geo_country: string (nullable = true)
| |-- geo_city: string (nullable = true)
| |-- creation_timestamp: long (nullable = true)
| |-- x_ping_sender_version: string (nullable = true)
|-- application: struct (nullable = true)
| |-- displayVersion: string (nullable = true)
|-- environment: struct (nullable = true)
| |-- build: struct (nullable = true)
| | |-- application_name: string (nullable = true)
| | |-- architecture: string (nullable = true)
| | |-- version: string (nullable = true)
| | |-- build_id: string (nullable = true)
| | |-- vendor: string (nullable = true)
| | |-- hotfix_version: string (nullable = true)
| |-- partner: struct (nullable = true)
| | |-- distribution_id: string (nullable = true)
| | |-- distribution_version: string (nullable = true)
| | |-- partner_id: string (nullable = true)
| | |-- distributor: string (nullable = true)
| | |-- distributor_channel: string (nullable = true)
| | |-- partner_names: array (nullable = true)
| | | |-- element: string (containsNull = true)
| |-- settings: struct (nullable = true)
| | |-- telemetry_enabled: boolean (nullable = true)
| | |-- locale: string (nullable = true)
| | |-- update: struct (nullable = true)
| | | |-- channel: string (nullable = true)
| | | |-- enabled: boolean (nullable = true)
| | | |-- auto_download: boolean (nullable = true)
| |-- system: struct (nullable = true)
| | |-- os: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- version: string (nullable = true)
| | | |-- locale: string (nullable = true)
| |-- profile: struct (nullable = true)
| | |-- creation_date: long (nullable = true)
|-- payload: struct (nullable = true)
| |-- reason: string (nullable = true)
| |-- target_channel: string (nullable = true)
| |-- target_version: string (nullable = true)
| |-- target_build_id: string (nullable = true)
| |-- target_display_version: string (nullable = true)
| |-- previous_channel: string (nullable = true)
| |-- previous_version: string (nullable = true)
| |-- previous_build_id: string (nullable = true)
|-- submission_date_s3: string (nullable = true)
For more detail on the raw ping these fields come from, see the raw data.
Work in Progress
This article is a work in progress. The work is being tracked in this bug.
Guide to our Experimental Tools
Shield
- Shield is an addon-based experimentation platform with fine-tuned enrollment criteria. The system add-on landed in FF 53.
- For the moment, it sends back data in its own
shield
type ping, so there's lots of flexibility in data you can collect. - Uses the Normandy server to serve out study “recipes” (?)
- Annotates the main ping in the environment/experiments block
- The shield system is itself a system add-on, so rolling out changes to the entire system does not require riding release trains
- Strategy and Insights (strategyandinsights@mozilla.com) team are product owners and shepherd the study development and release process along
- Opt-out experiments should be available soon?
- Further reading:
- https://wiki.mozilla.org/Firefox/SHIELD
- https://wiki.mozilla.org/Firefox/Shield/Shield_Studies
- https://mozilla.github.io/shield-studies-docs/study-process/
- When should you use SHIELD over other options?
Preference Flipping experiments
Uses Normandy, requires NO additional addon as long as a preference rides the release train
Heartbeat
Survey mechanism, also run via Normandy
Telemetry Experiments
Pre-release only https://gecko.readthedocs.io/en/latest/browser/experiments/experiments/index.html
Funnelcake
Custom builds of Firefox that are served to some percentage of the direct download population
Work in Progress
This document is a work in progress. The work is being tracked in this bug
Search Data
Introduction
This article introduces the datasets we maintain for search analyses. After reading this article, you should understand the search datasets well enough to produce moderately complex analyses.
Table of Contents
Permissions
Access to both search_aggregates
and search_clients_daily
is heavily restricted in re:dash.
We also maintain a restricted group for search on Github and Bugzilla.
If you reach a 404 on Github or don't have access to a re:dash query or bug
this is likely your issue.
To get access permissions, file a bug using the search permissions template
Once you have proper permissions,
you'll have access to a new source in re:dash called Presto Search
.
You will not be able to access any of the search datasets
via the standard Presto
data source, even with proper permissions.
Terminology
Direct vs Follow-on Search
Searches can be split into two major classes: direct and follow-on.
Direct searches result from a direct interaction with a search access point
(SAP),
which is part of the Firefox UI.
These searches are often called SAP searches.
There are currently 6 SAPs:
urlbar
- entering a search query in the Awesomebarsearchbar
- the main search bar; not present by default for new profiles on Firefox 57+newtab
- the search bar on theabout:newtab
pageabouthome
- the search bar on theabout:home
pagecontextmenu
- selecting text and clicking "Search" from the context menusystem
- starting Firefox from the command line with an option that immediately makes a search
Users will often interact with the Search Engine Results Page (SERP)
to create "downstream" queries.
These queries are called follow-on queries
.
These are sometimes also referred to as in-content queries
since they are initiated from the content of the page itself
and not from the Firefox UI.
For example, follow-on queries can be caused by:
- Revising a query (
restaurants
becomesrestaurants near me
) - Clicking on the "next" button
- Accepting spelling suggestions
Tagged vs Untagged Searches
Our partners (search engines) attribute queries to Mozilla using partner codes. When a user issues a query through one of our SAPs, we include our partner code in the URL of the resulting search.
Tagged queries are queries that include one of our partner codes.
Untagged queries are queries that do not include one of our partner codes. If a query is untagged, it's usually because we do not have a partner deal for that search engine and region.
If an SAP query is tagged, any follow-on query should also be tagged.
Standard Search Aggregates
We report three types of searches in our search datasets:
SAP
, tagged-sap
, and tagged-follow-on
.
These aggregates show up as columns in the
search_aggregates
and search_clients_daily
datasets.
Our search datasets are all derived from main_summary
.
The aggregate columns are derived from the SEARCH_COUNTS
histogram.
The SAP
column counts all SAP (or direct) searches.
SAP
search counts are collected via
probes
within the Firefox UI
These counts are very reliable, but do not count follow-on queries.
In 2017-06 we deployed the [followonsearch
addon],
which adds probes for tagged-sap
and tagged-follow-on
searches.
These columns attempt to count all tagged searches
by looking for Mozilla partner codes in the URL of requests to partner search engines.
These search counts are critical to understanding revenue
since they exclude untagged searches and include follow-on searches.
However, these search counts have important caveats affecting their reliability.
See In Content Telemetry Issues for more information.
In main_summary
, all of these searches are stored in search_counts.count
,
which makes it easy to over count searches.
Avoid using main_summary
for search analyses.
Outlier Filtering
We remove search count observations representing more than 10,000 searches for a single search engine in a single ping.
In Content Telemetry Issues
The [followonsearch
addon] implements the probe
used to measure tagged-sap
and tagged-follow-on
searches.
This probe is critical to understanding our revenue.
It's the only tool that gives us a view of follow-on searches
and differentiates between tagged and untagged queries.
However, it comes with some notable caveats.
Relies on whitelists
The [followonsearch
addon] attempts to count all tagged searches
by looking for Mozilla partner codes in the URL of requests to partner search engines.
To do this, the addon relies on a whitelist of partner codes and URL formats.
The list of partner codes is incomplete and only covers a few top partners.
These codes also occasionally change so there will be gaps in the data.
Additionally, changes to search engine URL formats can cause problems with our data collection. See this query for a notable example.
Addon uptake
This probe is shipped as an addon. Versions 55 and greater have the addon installed by default (Bug). The addon was deployed to older versions of Firefox via GoFaster, but uptake is not 100%.
Limited historical data
The addon was first deployed in 2017-06.
There is no tagged-*
search data available before this.
Search Aggregates
Introduction
search_aggregates
is designed to power high level search dashboards.
It's quick and easy to query, but the data are coarse.
In particular, this dataset allows you to segment
by a limited number of client characteristics which are relevant to search markets.
However, it is not possible to normalize by client count.
If you need fine-grained data, consider using search_clients_daily
which breaks down search counts by client
Contents
Each row of search_aggregates
contains
the standard search count aggregations
for each unique combination of the following columns.
Unless otherwise noted, these columns are taken directly from main_summary
.
submission_date
-yyyymmdd
engine
- e.g.google
,bing
,yahoo
source
- The UI component used to issue a search - e.g.urlbar
,abouthome
country
locale
addon_version
- The installed version of the [followonsearch
addon]app_version
distribution_id
-NULL
means the standard Firefox buildsearch_cohort
-NULL
except for small segments relating to search experimentation
There are three aggregation columns:
sap
, tagged-sap
, and tagged-follow-on
.
Each of these columns represent different types of searches.
For more details, see the search data documentation
Note that, if there were no such searches in a row's segment
(i.e. the count would be 0),
the column value is null
.
Accessing the Data
Access to search_aggregates
is heavily restricted.
You will not be able to access this table without additional permissions.
For more details see the search data documentation.
Data Reference
Example Queries
This query calculates daily US searches. If you have trouble viewing this query, it's likely you don't have the proper permissions. For more details see the search data documentation.
Scheduling
This job is scheduled on airflow to run daily.
Schema
As of 2018-02-13,
the current version of search_aggregates
is v3
,
and has a schema as follows.
The dataset is backfilled through 2016-06-06
root
|-- country: string (nullable = true)
|-- engine: string (nullable = true)
|-- source: string (nullable = true)
|-- submission_date: string (nullable = true)
|-- app_version: string (nullable = true)
|-- distribution_id: string (nullable = true)
|-- locale: string (nullable = true)
|-- search_cohort: string (nullable = true)
|-- addon_version: string (nullable = true)
|-- tagged-sap: long (nullable = true)
|-- tagged-follow-on: long (nullable = true)
|-- sap: long (nullable = true)
Code Reference
The search_aggregates
job is
defined in python_mozetl
Search Clients Daily
Introduction
search_clients_daily
is designed to enable client-level search analyses.
Querying this dataset can be slow;
consider using search_aggregates
for coarse analyses.
Contents
search_clients_daily
has one row for each unique combination of:
(client_id
, submission_date
, engine
, source
).
In addition to the standard search count aggregations,
this dataset includes some descriptive data for each client.
For example, we include country
and channel
for each row of data.
In the event that a client sends multiple pings on a given submission_date
we choose an arbitrary value from the pings for that (client_id
, submission_date
),
unless otherwise noted.
There are three standard search count aggregation columns:
sap
, tagged-sap
, and tagged-follow-on
.
Note that, if there were no such searches in a row's segment
(i.e. the count would be 0),
the column value is null
.
Each of these columns represent different types of searches.
For more details, see the search data documentation
Background and Caveats
search_clients_daily
does not include
(client_id
submission_date
) pairs
if we did not receive a ping for that submission_date
.
We impute a NULL
engine
and source
for pings with no search counts.
This ensures users who never search are included in this dataset.
This dataset is large.
Consider using an ATMO Spark cluster for heavy analyses.
If you're querying this dataset from re:dash,
heavily limit the data you read using submission_date_s3
or sample_id
.
Accessing the Data
Access to search_clients_daily
is heavily restricted.
You will not be able to access this table without additional permissions.
For more details see the search data documentation.
Data Reference
Example Queries
This query
calculates searches per normalized_channel
for US clients on an arbitrary day.
If you have trouble viewing this query,
it's likely you don't have the proper permissions.
For more details see the search data documentation.
Scheduling
This dataset is scheduled on Airflow (source).
Schema
As of 2018-03-23, the current version of search_clients_daily
is v2
,
and has a schema as follows.
It's backfilled through 2016-06-07
root
|-- client_id: string (nullable = true)
|-- submission_date: string (nullable = true)
|-- engine: string (nullable = true)
|-- source: string (nullable = true)
|-- country: string (nullable = true)
|-- app_version: string (nullable = true)
|-- distribution_id: string (nullable = true)
|-- locale: string (nullable = true)
|-- search_cohort: string (nullable = true)
|-- addon_version: string (nullable = true)
|-- os: string (nullable = true)
|-- channel: string (nullable = true)
|-- profile_creation_date: long (nullable = true)
|-- default_search_engine: string (nullable = true)
|-- default_search_engine_data_load_path: string (nullable = true)
|-- default_search_engine_data_submission_url: string (nullable = true)
|-- sample_id: string (nullable = true)
|-- sessions_started_on_this_day: long (nullable = true)
|-- profile_age_in_days: integer (nullable = true)
|-- subsession_hours_sum: double (nullable = true)
|-- active_addons_count_mean: double (nullable = true)
|-- max_concurrent_tab_count_max: integer (nullable = true)
|-- tab_open_event_count_sum: long (nullable = true)
|-- active_hours_sum: double (nullable = true)
|-- tagged-sap: long (nullable = true)
|-- tagged-follow-on: long (nullable = true)
|-- sap: long (nullable = true)
|-- tagged_sap: long (nullable = true)
|-- tagged_follow_on: long (nullable = true)
|-- submission_date_s3: string (nullable = true)
Code Reference
The search_clients_daily
job is
defined in python_mozetl
Other Datasets
These datasets are for projects outside of the Firefox telemetry domain.
hgpush
This dataset records facts about individual commits to the Firefox source tree
in the mozilla-central
source
code repository.
Data Reference
The dataset is accessible via STMO
.
Use the eng_workflow_hgpush_parquet_v1
table with the Athena
data source.
(The Presto
data source is also available, but much slower.)
Field Types and Descriptions
See the hgpush
ping schema
for a description of available fields.
Be careful to:
- Use the latest schema version. e.g.
v1
. Browse thehgpush
schema directory in the GitHub repo to be sure. - Change dataset field names from
camelCaseNames
tounder_score_names
in STMO. e.g.reviewSystemUsed
in the ping schema becomesreview_system_used
in STMO.
Example Queries
Select the number of commits with an 'unknown' review system in the last 7 days:
select
count(1)
from
eng_workflow_hgpush_parquet_v1
where
review_system_used = 'unknown'
and date_diff('day', from_unixtime(push_date), now()) < 7
Code Reference
The dataset is populated via the Commit Telemetry Service.
Obsolete Datasets
These datasets are no longer updated or maintained. Please reach out to the Data Platform team if you think your needs are best met by an obsolete dataset.
Heavy Users
As of 2018-05-18, this dataset has been deprecated and is no longer maintained. See Bug 1455314
Replacement
We've moved to assigning user's an active tag based on total_uri_count
, see
the Active DAU definition.
The activity of a user based on active_ticks
is available in clients_daily
in the active_hours_sum
field, which has the sum(active_ticks / 720)
.
To retrieve a client's 28-day active_hours
, use the following query:
SELECT submission_date_s3,
client_id,
SUM(active_hours_sum) OVER (PARTITION BY client_id
ORDER BY submission_date_s3 ASC
ROWS 27 PRECEDING) AS monthly_active_hours
FROM
clients_daily
Introduction
The heavy_users
table provides information about whether a given client_id
is
considered a "heavy user" on each day (using submission date).
Contents
The heavy_users
table contains one row per client-day, where day is
submission_date
. A client has a row for a specific submission_date
if
they were active at all in the 28 day window ending on that submission_date
.
A user is a "heavy user" as of day N if, for the 28 day period ending
on day N, the sum of their active_ticks
is in the 90th percentile (or
above) of all clients during that period. For more analysis on this,
and a discussion of new profiles, see
this link.
Background and Caveats
- Data starts at 20170801. There is technically data in the table before
this, but the
heavy_user
column isNULL
for those dates because it needed to bootstrap the first 28 day window. - Because it is top the 10% of clients for each 28 day period, more
than 10% of clients active on a given
submission_date
will be considered heavy users. If you join with another data source (main_summary
, for example), you may see a larger proportion of heavy users than expected. - Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
- There is no "fixing" or weighting of new profiles - days before the
profile was created are counted as zero
active_ticks
. Analyses may need to use the includedprofile_creation_date
field to take this into account.
Accessing the Data
The data is available both via sql.t.m.o
and Spark.
In Spark:
spark.read.parquet("s3://telemetry-parquet/heavy_users/v1")
In SQL:
SELECT * FROM heavy_users LIMIT 3
Further Reading
The code responsible for generating this dataset is here
Data Reference
Example Queries
Example queries:
- Join
heavy_users
withmain_summary
to get distribution ofmax_concurrent_tab_count
for heavy vs. non-heavy users - Join
heavy_users
withlongitudinal
to get crash rates for heavy vs. non-heavy users
Scheduling
This dataset is updated daily via the telemetry-airflow infrastructure.
The job DAG runs every day after main_summary
is complete.
You can find the job definition
here.
Schema
As of 2017-10-05, the current version of the heavy_users
dataset is v1
, and has a schema as follows:
root
|-- client_id: string (nullable = true)
|-- sample_id: integer (nullable = true)
|-- profile_creation_date: long (nullable = true)
|-- active_ticks: long (nullable = true)
|-- active_ticks_period: long (nullable = true)
|-- heavy_user: boolean (nullable = true)
|-- prev_year_heavy_user: boolean (nullable = true)
|-- submission_date_s3: string (nullable = true)
Code Reference
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.
Profile Creation - The technical part
What is a profile?
All of the changes a user makes in Firefox, like the home page, what toolbars you use, installed addons, saved passwords and your bookmarks, are all stored in a special folder, called a profile. Telemetry stores archived and pending pings in the profile directory as well as metadata like the client ID.
Every run of Firefox needs a profile. However a single installation can use multiple profiles for different runs. The profile folder is stored in a separate place from the Firefox program so that, if something ever goes wrong with Firefox, the profile information will still be there.
Firefox also comes with a Profile Manager, a different run mode to create, migrate and delete the profiles.
Profile Behaviors
In order to understand the behavior of users and base analysis on things like the profile creation date, it is essential to understand how a profile is created and identified by the browser. Also, it is important to understand how user actions with and within profiles affect our ability to reason about profiles from a data perspective. This includes resetting or deleting profiles or opting into or out of sending Telemetry data.
The different cases are described in more detail in the following sections.
Profile Creation
There are multiple ways a Firefox profile can be created. Some of these behave slightly differently.
Profiles can be created and managed by the Firefox Profile Manager:
- New profile on first launch
- New profile from Profile Manager
--createprofile
command line argument
Profiles can be created externally and not be managed by the Firefox Profile Manager:
--profile
command line argument
Managed: First use
When Firefox is opened for the first time after a fresh install, without any prior Firefox profile on disk visible to Firefox, it will create a new profile. Firefox uses "Default User" as the profile name, creates the profile's directory with a random suffix and marks the new profile as default for subsequent starts of Firefox. Read where Firefox stores your profile data.
Managed: Profile Manager creation
The user can create a new profile through the Profile Manager.
This can either be done on about:profiles
in a running Firefox or by starting Firefox with the --ProfileManager
flag.
The Profile Manager will ask for a name for the profile and picks a new directory for it.
The Profile Manager allows the user to create a new profile from an existing directory (in which case any files will be included) or from scratch (in which case the directory will be created).
The --createprofile
flag can be used from the command line and works the same as creating a profile through the Profile Manager.
Unmanaged: Command-line start
Firefox can be started on the command line with a path to a profile directory: firefox --profile path/to/directory
.
If the directory does not exist it will be created.
A profile created like this will not be picked up by the Profile Manager.
Its data will persist after Firefox is closed, but the Profile Manager will not know about it.
The profile will not turn up in about:profiles
.
Profile Reset
A user can reset the profile (see Refresh Firefox - reset addons and settings). This will copy over most user data to a new directory, keeping things like the history, bookmarks and cookies, but will remove extensions, modified preferences and added search engines.
A profile reset will not change the Telemetry clientID
.
The date of the most recent profile reset is saved and will be contained in Telemetry pings in the profile.resetDate
field.
Profile Deletion
A profile can be deleted through the Profile Manager, which will delete all stored data from disk. The profile can also be deleted by simply removing the profile's directory. We will never know about a deletion. We simply won't see that profile in new Telemetry data anymore.
Uninstalling the Firefox installation will not remove any profile data.
Note: Removing a profile's directory while it is in use is not recommended and will lead to a corrupt state.
Telemetry opt-out
The user can opt out of sending Telemetry data.
When the user opts out, Telemetry sends one "optout" ping, containing an empty payload.
The local clientID
is reset to a fixed value.
When a user opts into sending Telemetry data, a new clientID
is generated and used in subsequent pings.
The profile itself and the profile creation date are unaffected by this.
Profile Creation Date
The profile creation date is the assumed date of initial profile creation. However it proved to be not reliable for all cases. There are multiple ways this date is determined.
Managed: During Profile Creation
When a profile is created explicitly the profile directory is created and a times.json
containing a timestamp of the current time is stored inside that profile directory1.
It is read at later times when the profile creation date is used.
Relevant parts in the code: nsAppRunner::SelectProfile
calling nsToolkitProfileService::CreateProfile
.
Unmanaged: Empty profile directory
When --profile path/to/directory
is passed on the command line, the directory is created if it does not exist, but no times.json
is written2.
On the first access of the profile creation date (through ProfileAge.jsm
) the module will detect that the times.json
is missing.
It will then iterate through all files in the current profile's directory, reading file creation or modification timestamps.
The oldest of these timestamps is then assumed to be the profile creation date and written to times.json
.
Subsequent runs of Firefox will then use this date.
Relevant part in the code: nsAppRunner::SelectProfile
creating the directory.
Real World Usage
This page backs away from our profile-focused data view and examines what Firefox Desktop usage looks like in the real world. There are many components and layers that exist between a user acquiring and running Firefox, and this documentation will illuminate what those are and how they can affect the meaning of a profile.
Real Life Components of Firefox Desktop Usage
The above image illustrates all the layers that sit between a user acquiring and running Firefox Desktop and the Telemetry pings we receive.
- 1: The user
- A human being presumably.
- 2: The machine
- The physical hardware running Firefox.
- 3: The disk image / hard drive
- A single machine could have separate partitions running different OSes.
- Multiple machines could run copies of a single disk image
- Disk images are also used as backups to restore a machine.
- 4: OS user profile
- Most operating systems allow users to log into different user profiles with separate user directories (such as a "Guest" account).
- Usually, Firefox is installed into a system directory that all users profiles will share, but Firefox profiles are saved within the user directories, effectively segregating them.
- 5: Firefox binary / installer
- The downloaded binary package or stub installer which installs Firefox into the disk image. Users can get these from our website or one of our managed properties, but they can also acquire these from 3rd party sources as well.
- Our website is instrumented with Google Analytics to track download numbers, but other properties (FTP) and 3rd party sources are not. Google Analytics data is not directly connected to Telemetry data.
- A user can produce multiple installations from a single Firefox binary / installer. For example, if a user copies it to a USB stick or keeps it in cloud storage, they could install Firefox on multiple machines from a single binary / installer.
- 6: Firefox installation
- The installed Firefox program on a given disk image.
- Since Firefox is usually installed in a system directory, the single installation of Firefox will be shared by all the OS user profiles in the disk image.
- Stub installers are instrumented with pings to report new install counts, however, full binaries are not.
- 7: Firefox profile
- The profile Firefox uses during a user's session.
- A user can create multiple Firefox profiles using the Firefox Profile Manager, or by specifying a custom directory to use at startup. More details here.
- This is the entity that we see in Telemetry. Profiles send pings to Telemetry with a client ID as its identifier.
Desktop Browser Use Cases
Below are the rough categories of Firefox use cases that we know happen in the real world.
Note, these categories are rough approximations, and are not necessarily mutually exclusive.
Regular User
What we imagine a typical user to be. Someone who buys a computer, always uses a default OS user profile, downloads Firefox once, installs it, and continues using the default Firefox profile.
In Telemetry, this user would just show up as a single client ID.
Assuming they went through our normal funnel, they should show up once in Google Analytics as a download and once in stub installer pings as a new installation (if they used a stub installer).
Multi-Profile User
A more advanced user, who uses multiple Firefox profiles in their normal, everyday use, but otherwise is pretty 'normal' (uses the same OS user profile, etc.).
In Telemetry, this user would show up as 2 (or more) separate client IDs. We would have no way to know they came from the same computer and user without identifying that the subsessions are never overlapping and that large portions of the environment (CPU, GPU, Displays) are identical and that would be no guarantee.
Assuming they went through our normal funnel, they would show up once in Google Analytics as a download and once in stub installer pings as a new installation (if they used a stub installer).
However, any subsequent new Firefox profile creations would not have any corresponding downloads or installations. Since Firefox 55 however, any newly created profile will send a "new-profile" ping.
Shared Computer
A situation where there is a computer that is shared across multiple users and each user uses a different OS user profile. Since Firefox profiles live at the user directory level, each user would have a separate Firefox profile. Note, users logging in under a "Guest" account in most machines falls into this category.
In this case, every user who logged into this one computer with a different OS user profile would show up as a different client ID. We have no way of knowing they came from the same computer.
Furthermore, if the computer wiped the user directory after use, like Guest accounts and university computer labs often do, then they would show up as a new client ID every time they logged in, even if they have used the same computer multiple times. This use case could inflate new profile counts.
Similar to Multi-Profile Users, in this use case, there would be only one download event and install event (assuming normal funnel and stub installer), but multiple client ID's.
Cloned Machines
In this case, there are actually multiple users with computers that all share the same disk image at some point.
Think of the situation where the IT staff sets up the computer for a new hire at a company. Instead of going through to trouble of installing all the required programs and setting them up correctly for each computer, they'll do it once on one computer, save the disk image, and simply copy it over each time they need to issue a new machine.
Or think of the case where the IT staff of a library needs to set up 2 dozen machines at once.
In this case, depending on the state of the disk image when it was copied, we could see multiple client ID's for each user+machine, or we could see all the user+machines sharing the same client ID.
If the disk image was copied after a Firefox profile was created, then the old user+machine and new user+machine will share the same client ID, and be submitting pings to us concurrently.
If the disk image was copied after the Firefox installation but before an initial Firefox profile was created, then each user+machine will get their own Firefox profile and client ID when they run Firefox for the first time.
As with the Multi-Profile User and Shared Computer case, even though there could be multiple Firefox profiles in this use case, there will only be one download and install event.
Migrations
Type 1: Migrate Disk Image
A user has a backup of their disk image and when they switch to a new computer or their current computer crashes, they simply reboot from the old disk image.
In this case, the old machine and the new machine will just share the same client ID (assuming that the disk was copied after a Firefox profile was created). In fact, it will look exactly like the Cloned Machines case, except that instead of sending pings concurrently, they'll be sending us pings first from the old machine and then from the new machine.
Also, it should be noted that their Firefox profile will 'revert' back to the state that it was in when the disk image was copied, essentially starting over from the past, and any unsent pings on the image (if they exist) will be resent.
For instance, we will see another ping with the profile_subsession_count
(the count of how many subsessions a profile has seen in its history) we previously saw some time before.
Again, there will only be one download and install associated with this use case (assuming normal funnel and stub installer).
Type 2: Migrate OS User Directory
A user has a backup of their OS user directory and copies it to a new machine.
This is similar to Type 1 migration, but instead of copying the entire disk, the user only copies the OS user directory. Since the Firefox profile lives in the OS user directory, the old machine and new machine will share the same client ID.
The only difference is since the Firefox Installation lives in system directories, the client might have to re-download and re-install the browser. However, if they also copy the Firefox binary / installer, there will not be a download event, only an install event.
Type 3: Migrate Firefox Binary / Installer
A user has the Firefox binary or installer saved on their old machine and copies it over to a new machine to install Firefox.
In this case, there will not be a second download event, but there will be an install event and the new and old machines will have separate client ID's.
Type 4: Migrate Firefox Profile
A user copies their old Firefox profile from their old machine to a new computer, and runs Firefox using the copied Firefox profile.
In this case, since the Firefox profile is being copied over, both the new and the old machine will have profiles with the same client ID. Again, the profile on the new computer will revert back to the point in its history where it was copied. And since the profile contains any unsent Telemetry pings, we may receive duplicated submissions of pings from the same client ID.
If the Firefox binary / installer was downloaded, there will be a download and install event. If it was migrated via USB stick, it will only have an install event.
Profile History
A profile's history is simply the progression of that profile's subsessions over its lifetime. We can see this in our main pings by checking:
profile_subsession_counter
- A counter which starts at 1 on the very first run of a profile and increments for each subsession. This counter will be reset to 1 if a user resets / refreshes their profile.
subsession_start_date
- The date and time the subsession starts in, truncated to hours. This field is not always reliable due to local clock skew.
previous_subsession_id
- The ID of the previous subsession. Will be
null
for the very first subsession, or the first subsession after a user resets / refreshes their profile.
- The ID of the previous subsession. Will be
subsession_id
- The ID of the current subsession.
submission_date_s3
- The date we received the ping. This date is sourced from the server's time and reliable.
profile_reset_date
- The date the profile was reset. Will be
null
if the profile was not reset.
- The date the profile was reset. Will be
This is a nice clean example of profile history. It has a clear starting ping and it progresses linearly, with each subsession connecting to the next via subsession_id
. However, due to the fact that profiles can be shared across machines, and restored manually, etc. strange behaviors can arise (see Real World Usage).
Profile History Start Conditions
Under normal assumptions, we expect to see the starting ping in a profile's history in our telemetry data. The starting ping in the profile's history is the ping from their very first subsession. We expect this ping to have profile_subsession_counter = 1
and previous_subsession_id is null
and profile_reset_date is null
.
However, not all profiles appear in our data with a starting ping and instead appear to us mid-history.
History Has Beginning
As you can see, this profile starts with a ping where profile_subsession_counter = 1
and previous_subsession_id is null
.
History Has No Beginning
In this example, the profile simply appears in our data mid-history, with presumably the 25th subsession in it's history. Its previous history is a mystery.
Profile History Progression Events
After a profile appears, in 'normal' conditions, there should be a linear, straightforward progression with each subsession linking to the next.
However, the following abnormal events can occur.
History Gap
There is a gap in the profile history.
It's possible this behavior is due to dropped pings.
Here, we see a gap between the 30th ping and the 41st ping and the 44th ping.
History Splits
The history of a profile splits, and after a single subsession, there are two (or more) subsessions that link back to it.
This is probably due to cloned machines or disk image restores. Note, after the profile splits, the two branches might continue concurrently or one branch might die while the other continues. It is very hard to distinguish between the different branches of the same profile.
- Profile begins
- Profile splits: branch 1
- Profile splits: branch 2
In this example, the profile history starts normally, but on the 5th ping, the history splits into two branches that seem to progress concurrently.
History Restarts
The history of a profile suddenly starts over, with a brand new starting ping.
- Profile begins
- Profile restarts
Here, we see the profile start their history normally, but then they begin a new, totally unconnected branch with a starting ping that is not the same as the original starting ping (different subsession_id
s).
History Reruns
How to Order History
Contributing
Documentation is critical to making a usable data platform. When surveying our users, their most common complaint has been our lack of documentation. It's important that we improve our documentation as often as possible.
Bug reports
If you see an error in the documentation or want to extend a chapter, please file a bug.
Getting the Raw Documentation
The documentation is intended to be read as HTML at
docs.telemetry.mozilla.org
.
However, we store the documentation in raw text files in the
firefox-data-docs
repo.
To begin contributing to the docs, fork the firefox-data-docs
repo.
Building the Documentation
The documentation is rendered with mdBook
To build the documentation locally,
you'll need to install the mdbook-dtmo
wrapper.
Binary builds are provided at https://github.com/badboy/mdbook-dtmo/releases.
You can install a binary build directly:
curl -LSfs https://japaric.github.io/trust/install.sh | sh -s -- --git badboy/mdbook-dtmo
If you have Rust installed, you can build and install mdbook-dtmo
:
cargo install --git https://github.com/badboy/mdbook-dtmo
You can then serve the documentation locally with:
mdbook-dtmo serve
The complete documentation for the mdBook toolchain is at: https://rust-lang-nursery.github.io/mdBook/.
If you run into any technical limitations, let me (@harterrt
) know.
I'm happy to change the tooling to make it as much fun as possible to write.
Adding a new article
Be sure to link to your new article from SUMMARY.md
, or mdBook will not render the file.
The structure of the repository is outlined in this article.
This documentation is under active development, so we may already be working on the documentation you need. Take a look at this bug component to check.
Style Guide
Articles should be written in Markdown (not AsciiDoc). Markdown is usually powerful enough and is a more common technology than AsciiDoc.
Limit lines to 100 characters where possible. Try to split lines at the end of sentences, or use Semantic Line Breaks. This makes it easier to reorganize your thoughts later.
This documentation is meant to be read digitally. Keep in mind that people read digital content much differently than other media. Specifically, readers are going to skim your writing, so make it easy to identify important information.
Use visual markup like bold text, code blocks
, and section headers.
Avoid long paragraphs.
Short paragraphs that describe one concept each makes finding important information easier.
Spell checking
Articles should use proper spelling, and pull requests will be automatically checked for spelling errors.
Technical articles often contain words that are not recognized by common dictionaries, if this
happens you may either put specialized terms in code blocks
, or you may add an exception to
the .spelling
file in the code repository.
For things like dataset names or field names, code blocks
should be preferred. Things like
project names or common technical terms should be added to the .spelling
file.
To run the spell checker locally,
install the markdown-spellcheck
library,
then run the scripts/spell_check.sh
script from the root of the repository.
You may also remove the --report
parameter to begin an interactive fixing session. In this
case, it is highly recommended to also add the --no-suggestions
parameter, which greatly
speeds things up.
Link checking
Any web links should be valid. A dead link might not be your fault, but you will earn a lot of good karma by fixing a dead link!
To run the link checker locally, install the markdown-link-check
library, then run the scripts/link_check.sh
script from the root of the repository.
Supported Plugins
Mermaid
You may use mermaid.js
diagrams in code blocks:
graph LR
you -->|write|docs
docs --> profit!
Which will be rendered as:
Review
Once you're happy with your contribution, please open a PR and flag @harterrt
for review.
Please squash your changes into meaningful commits and follow these
commit message guidelines.
Publishing
The documentation is hosted on Github Pages.
Updates to the documentation are automatically published to
docs.telemetry.mozilla.org
when changes are merged.
To publish to your own fork of this repo, changes need to be pushed manually. Use the deploy script to publish new changes.
This script depends on
ghp-import
.
Keep in mind that this will deploy the docs to your origin
repo.
If you're working from a fork (which you should be),
deploy.sh
will update the docs hosted from your fork - not the production docs.
Colophon
This document's structure is heavily influenced by Django's Documentation Style Guide.
You can find more context for this document in this blog post.
Documentation Structure
The directory structure is meant to feel comfortable for those familiar with the data platform:
.
|--src
|--datasets - contains dataset level documentation
|--tools - contains tool level documentation
|--concepts - contains tutorials meant to introduce a new concept to the reader
|--cookbooks - focused code examples for reference
The prose documentation is meant to take the reader from beginner to expert. To this end, the rendered documentation has an order different from the directory structure:
- Getting Started: Get some simple analysis completed so the user understands the amount of work involved / what the product feels like
- Tutorials
- Data Tutorials: tutorials meant to give the reader a complete understanding of a specific dataset. Start with a high level overview, then move on to completely document the data including Data source, Sampling, Common Issues, and where the reader can find the code.
- Tools tutorials: Tutorials meant to introduce a single data tool or analysis best practice.
- Cookbooks
- Reference material - TBD