Choosing a dataset
This document is a quick overview of the datasets provided by the data pipeline team. After reading this document, you should have a good idea of what dataset is the most useful for your analysis.
Before making final decisions from your findings, consider getting your query or analysis reviewed by a member of the data pipeline team.
Accessing a dataset
Playing with the data is the best way to get comfortable. All of these datasets can be queried from STMO (AKA: re:dash). STMO will give you an interactive SQL interface, which should be sufficient for getting comfortable with the data. If you need to complete more complicated analyses or want to productionize your code, see Analyzing Telemetry Data.
Overview
Unless noted otherwise, these datasets summarize the "main" Telemetry ping type. The main ping payload contains most of the measurements used to track performance and health of Firefox in the wild. It can be useful to refer back to the source ping documentation to understand how the source data was generated.
Datasets
Table of Contents
Main Summary
The main_summary
table is the most direct representation of a main ping but
can be difficult to work with due to its size.
Rows and Columns
The main_summary
table contains one row for each ping. Each column
represents one field from the main ping payload, though only a subset of all
main ping fields are included.
Background and Caveats
This table is massive, and due to it's size, it can be difficult to work with.
You should avoid querying main_summary
from
re:dash. Your queries will be slow to
complete and can impact performance for other users, since re:dash on a
shared cluster.
Instead, we recommend using the longitudinal
or cross_sectional
dataset
where possible. If these datasets do not suffice, consider using Spark on an
ATMO cluster. In the odd case where these
queries are necessary, make use of the sample_id
field and limit to a short
submission date range.
Longitudinal
The longitudinal
dataset is a 1% sample of main ping data organized to
roughly represent a user. If you're not sure which dataset to use for your
analysis, this is probably what you want.
Rows and Columns
Each row in the longitudinal
dataset represents one client_id
, which is a
rough approximation of a user. Each column represents a field from the main
ping. Most fields contain arrays of values, with one value for each ping
associated with a client_id. Using arrays give you access to the raw data from
each ping, but can be difficult to work with from SQL. Take a look at the
longitudinal examples if you get stuck.
Background and Caveats
Think of the longitudinal table as wide and short. The dataset contains more
columns than main_summary
and down-samples to 1% of all clients to reduce
query computation time and save resources.
In summary, the longitudinal table differs from main_summary
in two important
ways:
- The longitudinal dataset groups all data so that one row represents a client_id
- The longitudinal dataset samples to 1% of all client_ids
Cross Sectional
The cross_sectional
dataset provides descriptive statistics for each
client_id in a 1% sample of main ping data. This dataset simplifies the
longitudinal table by replacing the longitudinal arrays with summary
statistics. This is the most useful dataset for describing our user base.
Rows and Columns
Each row in the cross_sectional
dataset represents one client_id
, which is
a rough approximation to a user. Each column is a summary statistic describing
the client_id.
For example, the longitudinal table has a row called geo_country
which
contains an array of country codes. For the same client_id the
cross_sectional
table has columns called geo_country_mode
and
geo_country_configs
containing single summary statistics for the modal country and the
number of distinct countries in the array.
client_id |
geo_country |
geo_country_mode |
geo_country_configs |
---|---|---|---|
1 | array<"US"> | "US" | 1 |
2 | array<"DE", "DE" "US"> | "DE" | 2 |
Background and Caveats
This table is much easier to work with than the longitudinal dataset because you don't need to work with arrays. Note that this dataset is a summary of the longitudinal dataset, so it is also a 1% sample of all client_ids.
Client Count
The client_count
dataset is simply a count of clients in a time period,
stratified over a set of dimensions.
Rows and Columns
This dataset includes columns for a dozen factors and an HLL variable. The
hll
column contains a
HyperLogLog variable, which is an
approximation to the exact count. The factor columns include activity date and
the dimensions listed
here.
Each row represents one combinations of the factor columns.
Background and Caveats
It's important to understand that the hll
column is not a standard count.
The hll
variable avoids double-counting users when aggregating over
multiple days. The HyperLogLog variable is a far more efficient way to count
distinct elements of a set, but comes with some complexity. To find the
cardinality of an HLL use cardinality(cast(hll AS HLL))
. To find the union
of two HLL's over different dates, use merge(cast(hll AS HLL))
. The Firefox
ER Reporting Query is
a good example to review. Finally, Roberto has a relevant writeup
here.
Crash Aggregates
The Crash Aggregates dataset compiles crash statistics over various dimensions for each day.
Rows and Columns
There's one column for each of the stratifying dimensions and the crash statistics. Each row is a distinct set of dimensions, along with their associated crash stats. Example stratifying dimensions include channel and country, example statistics include usage hours and plugin crashes. See the complete documentation for all available dimensions and statistics.
Appendix
Mobile Metrics
There are several tables owned by the mobile team documented here:
- android_events
- android_clients
- android_addons
- mobile_clients
Wiki For reference, a previous version of this document was stored
here.
Who to Contact
Feel free to open an issue or contact Ryan Harter (@harterrt) to identify errors or confusing parts of this documentation. I will be reviewing this documentation quarterly for accuracy.