Choosing a Dataset
This document will help you find the best data source for a given analysis.
This guide focuses on descriptive datasets and does not cover experimentation. For example, this guide will help if you need to answer questions like:
- How many users do we have in Germany, how many crashes we see per day?
- How many users have a given addon installed?
If you're interested in figuring out whether there's a causal link between two events take a look at our tools for experimentation.
Table of Contents
- Raw Pings
- Main Ping Derived Datasets
- Crash Ping Derived Datasets
- New-Profile Derived Datasets
- Update Derived Dataset
- Other Datasets
- Obsolete Datasets
- Appendix
Raw Pings
We receive data from our users via pings. There are several types of pings, each containing different measurements and sent for different purposes. To review a complete list of ping types and their schemata, see this section of the Mozilla Source Tree Docs.
Many pings are also described by a JSONSchema specification which can be found in this repository.
Background and Caveats
The large majority of analyses can be completed using only the main ping. This ping includes histograms, scalars, events, and other performance and diagnostic data.
Few analyses actually rely directly on the raw ping data. Instead, we provide derived datasets which are processed versions of these data, made to be:
- Easier and faster to query
- Organized to make the data easier to analyze
- Cleaned of erroneous or misleading data
Before analyzing raw ping data, check to make sure there isn't already a derived dataset made for your purpose. If you do need to work with raw ping data, be aware that loading the data can take a while. Try to limit the size of your data by controlling the date range, etc.
Accessing the Data
You can access raw ping data from an ATMO cluster using the Dataset API. Raw ping data are not available in re:dash.
Further Reading
You can find the complete ping documentation. To augment our data collection, see Collecting New Data and the Data Collection Policy.
Main Ping Derived Datasets
The main ping contains most of the measurements used to track performance and health of Firefox in the wild. This ping includes histograms, scalars, and events.
This section describes the derived datasets we provide to make analyzing this data easier.
longitudinal
The longitudinal
dataset is a 1% sample of main ping data
organized so that each row corresponds to a client_id
.
If you're not sure which dataset to use for your analysis,
this is probably what you want.
Contents
Each row in the longitudinal
dataset represents one client_id
,
which is approximately a user.
Each column represents a field from the main ping.
Most fields contain arrays of values, with one value for each ping associated with a client_id
.
Using arrays give you access to the raw data from each ping,
but can be difficult to work with from SQL.
Here's a query showing some sample data
to help illustrate.
Take a look at the longitudinal examples if you get stuck.
Background and Caveats
Think of the longitudinal table as wide and short.
The dataset contains more columns than main_summary
and down-samples to 1% of all clients to reduce query computation time and save resources.
In summary, the longitudinal table differs from main_summary
in two important ways:
- The longitudinal dataset groups all data so that one row represents a
client_id
- The longitudinal dataset samples to 1% of all
client_id
s
Please note that this dataset only contains release (or opt-out) histograms and scalars.
Accessing the Data
The longitudinal
is available in re:dash,
though it can be difficult to work with the array values in SQL.
Take a look at this example query.
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/longitudinal/
main_summary
The main_summary
table is the most direct representation of a main ping
but can be difficult to work with due to its size.
Prefer the longitudinal
dataset unless using the sampled data is prohibitive.
Contents
The main_summary
table contains one row for each ping.
Each column represents one field from the main ping payload,
though only a subset of all main ping fields are included.
This dataset does not include histograms.
Background and Caveats
This table is massive, and due to its size, it can be difficult to work with.
You should avoid querying main_summary
from re:dash.
Your queries will be slow to complete and can impact performance for other users,
since re:dash on a shared cluster.
Instead, we recommend using the longitudinal
or clients_daily
dataset where possible.
If these datasets do not suffice, consider using Spark on an
ATMO cluster.
In the odd case where these queries are necessary,
make use of the sample_id
field and limit to a short submission date range.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/main_summary/v4/
Though not recommended main_summary
is accessible through re:dash.
Here's an example query.
Your queries will be slow to complete and can impact performance for other users,
since re:dash is on a shared cluster.
Further Reading
The technical documentation for main_summary
is located in the
telemetry-batch-view documentation.
The code responsible for generating this dataset is here
first_shutdown_summary
The first_shutdown_summary
table is a summary of the first-shutdown
ping.
Contents
The first shutdown ping contains first session usage data. The
dataset has rows similar to the
telemetry_new_profile_parquet
,
but in the shape of
main_summary
.
Background and Caveats
Ping latency was reduced through the shutdown ping-sender mechanism in Firefox 55. To maintain consistent historical behavior, the first main ping is not sent until the second start up. In Firefox 57, a separate first-shutdown ping was created to evaluate first-shutdown behavior while maintaining backwards compatibility.
In many cases, the first-shutdown ping is a duplicate of the main ping. The first-shutdown summary can be used in conjunction with the main summary by taking the union and deduplicating on the document_id
.
Accessing the Data
The data can be accessed as first_shutdown_summary
. It is currently stored in the following path.
s3://telemetry-parquet/first_shutdown_summary/v4/
The data is backfilled to 2017-09-22, the date of its first nightly appearance. This data should be available to all releases on and after Firefox 57.
client_count_daily
The client_count_daily
dataset is useful for estimating user counts over a few
pre-defined dimensions.
The client_count_daily
dataset is similar to the deprecated
client_count
dataset
except that is aggregated by submission date and not activity date.
Content
This dataset includes columns for a dozen factors and an HLL variable.
The hll
column contains a
HyperLogLog
variable, which is an approximation to the exact count.
The factor columns include submission date and the dimensions listed
here.
Each row represents one combinations of the factor columns.
Background and Caveats
It's important to understand that the hll
column is not a standard count.
The hll
variable avoids double-counting users when aggregating over multiple days.
The HyperLogLog variable is a far more efficient way to count distinct elements of a set,
but comes with some complexity.
To find the cardinality of an HLL use cardinality(cast(hll AS HLL))
.
To find the union of two HLL's over different dates, use merge(cast(hll AS HLL))
.
The Firefox ER Reporting Query
is a good example to review.
Finally, Roberto has a relevant write-up
here.
Accessing the Data
The data is available in Re:dash. Take a look at this example query.
I don't recommend accessing this data from ATMO.
Further Reading
churn
The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.
Content
Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0)
for groups of clients with some shared attributes. A group of clients with
shared attributes is called a cohort. The cohorts in this dataset are created
every week and can be tracked over time using the acquisition_date
and the
weeks since acquisition or current_week
.
The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.
client | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | X | X | |||||||||||||
B | X | X | X | X | X | X | |||||||||
C | X | X |
All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 3.
The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..
client | 0 | 1 | 2 |
---|---|---|---|
A | X | X | |
B | X | ||
C | X | X |
The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.
Background and Caveats
- Each row in this dataset describes a unique segment of users
- The number of rows is exponential with the number of dimensions
- New fields should be added sparing to account for data-set size
- The dataset lags by 10 days in order account for submission latency
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
However,
churn_v3
still tracks releases older than Firefox 55.
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
However,
- The start of the period is fixed to Sundays. Once it has been aggregated, the
period cannot be shifted due to the way clients are counted.
- A supplementary 1-day
retention
dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO withmain_summary
orclients_daily
datasets.
- A supplementary 1-day
Accessing the Data
churn
is available in Re:dash under Athena and Presto. The data is also
available in parquet for consumption by columnar data engines at
s3://telemetry-parquet/churn/v3
.
retention
The retention
table provides client counts relevant to client retention at a
1-day granularity. The project is tracked in Bug 1381840
Contents
The retention
table contains a set of attribute columns used to specify a
cohort of users and a set of metric columns to describe cohort activity. Each
row contains a permutation of attributes, an approximate set of clients in a
cohort, and the aggregate engagement metrics.
This table uses the HyperLogLog (HLL) sketch to create an approximate set of
clients in a cohort. HLL allows counting across overlapping cohorts in a single
pass while avoiding the problem of double counting. This data-structure has the
benefit of being compact and performant in the context of retention analysis,
at the expense of precision. For example, calculating a 7-day retention period
can be obtained by aggregating over a week of retention data using the union
operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT
over client_id
's in the 7-day window.
Background and Caveats
- The data starts at 2017-03-06, the merge date where Nightly started to
track Firefox 55 in Mozilla-Central. However, there was
not a consistent view into the behavior of first session profiles until the
new_profile
ping. This means much of the data is inaccurate before 2017-06-26. - This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
- Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
- The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.
Accessing the Data
The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.
The column should first be cast to the HLL type. The scalar
cardinality(<hll_column>)
function will approximate the number of unique
items per HLL object. The aggregate merge(<hll_column>)
function will perform
the set union between all objects in a column.
Example: Cast the count column into the appropriate type.
SELECT cast(hll as HLL) as n_profiles_hll FROM retention
Count the number of clients seen over all attribute combinations.
SELECT cardinality(cast(hll as HLL)) FROM retention
Group-by and aggregate client counts over different release channels.
SELECT channel, cardinality(merge(cast(hll AS HLL))
FROM retention
GROUP BY channel
The HyperLogLog library wrappers are available for use outside of the
configured STMO environment, spark-hyperloglog
and
presto-hyperloglog
.
Also see the client_count_daily
dataset.
clients_daily
The clients_daily
table is intended as the first stop for asking questions
about how people use Firefox. It should be easy to answer simple questions.
Each row in the table is a (client_id
, submission_date
) and contains a
number of aggregates about that day's activity.
Contents
Many questions about Firefox take the form "What did clients with
characteristics X, Y, and Z do during the period S to E?" The
clients_daily
table is aimed at answer those questions.
Accessing the Data
The data is stored as a parquet table in S3 at the following address.
s3://telemetry-parquet/clients_daily/v6/
The clients_daily
table is accessible through re:dash using the Athena
data source. It is also available via the Presto
data source, though
Athena
should be preferred for performance and stability reasons.
Here's an example query.
Crash Ping Derived Datasets
The crash ping
is captured after the main Firefox process crashes or after a content process crashes,
whether or not the crash report is submitted to crash-stats.mozilla.org
.
It includes non-identifying metadata about the crash.
This section describes the derived datasets we provide to make analyzing this data easier.
crash_aggregates
The crash_aggregates
dataset compiles crash statistics over various dimensions for each day.
Rows and Columns
There's one column for each of the stratifying dimensions and the crash statistics. Each row is a distinct set of dimensions, along with their associated crash stats. Example stratifying dimensions include channel and country, example statistics include usage hours and plugin crashes. See the complete documentation for all available dimensions and statistics.
Accessing the Data
This dataset is accessible via re:dash.
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/crash_aggregates/v1/
Further Reading
The technical documentation for this dataset can be found in the telemetry-batch-view documentation
crash_summary
The crash_summary
table is the most direct representation of a crash ping.
Contents
The crash_summary
table contains one row for each crash ping.
Each column represents one field from the crash ping payload,
though only a subset of all crash ping fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/crash_summary/v1/
crash_summary
is accessible through re:dash.
Here's an example query.
Further Reading
The technical documentation for crash_summary
is located in the
telemetry-batch-view documentation.
The code responsible for generating this dataset is here
New-Profile Derived Datasets
The new-profile ping is sent from Firefox Desktop on the first session of a newly created profile and contains the initial information about the user environment.
This data is available in the telemetry_new_profile_parquet
dataset.
The telemetry_new_profile_parquet
table is the most direct representation of a new-profile ping.
Contents
The table contains one row for each ping. Each column represents one field from the new-profile ping payload, though only a subset of all fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-new-profile-parquet/v2/
The telemetry_new_profile_parquet
is accessible through re:dash.
Here's an example query.
Further Reading
This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1360256.
Update Derived Dataset
The update ping
is sent from Firefox Desktop when a browser update is ready to be applied and after it was correctly applied.
It contains the build information and the update blob information, in addition to some information about the
user environment.
The telemetry_update_parquet
table is the most direct representation of an update ping.
Contents
The table contains one row for each ping. Each column represents one field from the update ping payload, though only a subset of all fields are included.
Accessing the Data
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://net-mozaws-prod-us-west-2-pipeline-data/telemetry-update-parquet/v1/
The telemetry_update_parquet
is accessible through re:dash.
Here's an example query.
Further Reading
This dataset is generated automatically using direct to parquet. The configuration responsible for generating this dataset was introduced in bug 1384861.
Other Datasets
Public crash statistics for Firefox are available through the Data Platform in a socorro_crash
dataset.
The crash data in Socorro is sanitized and made available to ATMO and STMO.
A nightly import job converts batches of JSON documents into a columnar format using the associated JSON Schema.
Contents
Accessing the Data
The dataset is available in parquet at s3://telemetry-parquet/socorro_crash/v2
.
It is also indexed with Athena and Presto with the table name socorro_crash
.
Obsolete Datasets
heavy_users
The heavy_users
table provides information about whether a given client_id
is
considered a "heavy user" on each day (using submission date).
Contents
The heavy_users
table contains one row per client-day, where day is
submission_date
. A client has a row for a specific submission_date
if
they were active at all in the 28 day window ending on that submission_date
.
A user is a "heavy user" as of day N if, for the 28 day period ending
on day N, the sum of their active_ticks
is in the 90th percentile (or
above) of all clients during that period. For more analysis on this,
and a discussion of new profiles, see
this link.
Background and Caveats
- Data starts at 20170801. There is technically data in the table before
this, but the
heavy_user
column isNULL
for those dates because it needed to bootstrap the first 28 day window. - Because it is top the 10% of clients for each 28 day period, more
than 10% of clients active on a given
submission_date
will be considered heavy users. If you join with another data source (main_summary
, for example), you may see a larger proportion of heavy users than expected. - Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
- There is no "fixing" or weighting of new profiles - days before the
profile was created are counted as zero
active_ticks
. Analyses may need to use the includedprofile_creation_date
field to take this into account.
Accessing the Data
The data is available both via sql.t.m.o
and Spark.
In Spark:
spark.read.parquet("s3://telemetry-parquet/heavy_users/v1")
In SQL:
SELECT * FROM heavy_users LIMIT 3
Further Reading
The code responsible for generating this dataset is here
Appendix
Mobile Metrics
There are several tables owned by the mobile team documented here:
android_events
android_clients
android_addons
mobile_clients