Longitudinal Reference
Introduction
The longitudinal
dataset is a 1% sample of main ping data
organized so that each row corresponds to a client_id
.
If you're not sure which dataset to use for your analysis,
this is probably what you want.
Contents
Each row in the longitudinal
dataset represents one client_id
,
which is approximately a user.
Each column represents a field from the main ping.
Most fields contain arrays of values, with one value for each ping associated with a client_id
.
Using arrays give you access to the raw data from each ping,
but can be difficult to work with from SQL.
Here's a query showing some sample data
to help illustrate.
Take a look at the longitudinal examples if you get stuck.
Background and Caveats
Think of the longitudinal table as wide and short.
The dataset contains more columns than main_summary
and down-samples to 1% of all clients to reduce query computation time and save resources.
In summary, the longitudinal table differs from main_summary
in two important ways:
- The longitudinal dataset groups all data so that one row represents a
client_id
- The longitudinal dataset samples to 1% of all
client_id
s
Please note that this dataset only contains release (or opt-out) histograms and scalars.
Accessing the Data
The longitudinal
is available in re:dash,
though it can be difficult to work with the array values in SQL.
Take a look at this example query.
The data is stored as a parquet table in S3 at the following address. See this cookbook to get started working with the data in Spark.
s3://telemetry-parquet/longitudinal/
Data Reference
Example Queries
Take a look at the Longitudinal Examples Cookbook.
Sampling
Pings Within Last 6 Months
The longitudinal
filters to main
pings from within the last 6 months.
1% Sample
The longitudinal dataset samples down to 1% of all clients in the above sample. The sample is generated by the following process:
- hash the
client_id
for each ping from the last 6 months. - project that hash onto an integer from 1:100, inclusive
- filter to pings with
client_id
s matching a 'magic number' (in this case 42)
This process has a couple of nice properties:
- The sample is consistent over time.
The
longitudinal
dataset is regenerated weekly. The clients included in each run are very similar with this process. The only change will come from never-before-seen clients, or clients without a ping in the last 180 days. - We don't need to adjust the sample as new clients enter or exit our pool.
More practically,
the sample is created by filtering to pings with main_summary.sample_id == 42
.
If you're working with main_summary
,
you can recreate this sample by doing this filter manually.
Scheduling
The longitudinal
job is run weekly, early on Sunday morning UTC.
The job is scheduled on Airflow.
The DAG is here.
Schema
TODO(harter)
: https://bugzilla.mozilla.org/show_bug.cgi?id=1361862
Code Reference
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.