Churn
Introduction
The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.
Content
Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0)
for groups of clients with some shared attributes. A group of clients with
shared attributes is called a cohort. The cohorts in this dataset are created
every week and can be tracked over time using the acquisition_date
and the
weeks since acquisition or current_week
.
The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.
client | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | X | X | |||||||||||||
B | X | X | X | X | X | X | |||||||||
C | X | X |
All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 3.
The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..
client | 0 | 1 | 2 |
---|---|---|---|
A | X | X | |
B | X | ||
C | X | X |
The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.
Background and Caveats
- Each row in this dataset describes a unique segment of users
- The number of rows is exponential with the number of dimensions
- New fields should be added sparing to account for data-set size
- The dataset lags by 10 days in order account for submission latency
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
However,
churn_v3
still tracks releases older than Firefox 55.
- This value was determined to be time for 99% of main pings to arrive at the
server. With the shutdown-ping sender, this has been reduced to 4 days.
However,
- The start of the period is fixed to Sundays. Once it has been aggregated, the
period cannot be shifted due to the way clients are counted.
- A supplementary 1-day
retention
dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO withmain_summary
orclients_daily
datasets.
- A supplementary 1-day
Accessing the Data
churn
is available in Re:dash under Athena and Presto. The data is also
available in parquet for consumption by columnar data engines at
s3://telemetry-parquet/churn/v3
.
Data Reference
Example Queries
This section walks through a typical query to generate data suitable for visualization.
field | type | description |
---|---|---|
cohort_date | common, attribute | The start date bucket of the cohort. This is week the client was acquired. |
elapsed_periods | common, attribute | The number of periods that have elapsed since the cohort date. In this dataset, the retention period is 7 days. |
channel | attribute | Part of the release train model. An attribute that distinguishes cohorts. |
geo | filter attribute | Country code. Used to filter out all countries other than the 'US' |
n_profiles | metric | Count of users in a cohort. Use sum to aggregate. |
First the fields are extracted and aliased for consistency. cohort_date
and
elapsed_periods
are common to most retention queries and are useful concepts
for building on other datasets.
WITH extracted AS (
SELECT acquisition_period AS cohort_date,
current_week AS elapsed_periods,
n_profiles,
channel,
geo
FROM churn
),
The extracted table is filtered down to the attributes of interest. The cohorts
of interest originate in the US and are in the release or beta channels. Note
that channel
here is the concatenation of the normalized channel and the
funnelcake id. Only cohorts appearing after August 6, 2017 are chosen to be in
this population.
population AS (
SELECT channel,
cohort_date,
elapsed_periods,
n_profiles
FROM extracted
WHERE geo = 'US'
AND channel IN ('release', 'beta')
AND cohort_date > '20170806'
-- filter out noise from clients with incorrect dates
AND elapsed_periods >= 0
AND elapsed_periods < 12
),
The number of profiles is aggregated by the cohort dimensions. The cohort acquisition date and elapsed periods since acquisition are fundamental to cohort analysis.
cohorts AS (
SELECT channel,
cohort_date,
elapsed_periods,
sum(n_profiles) AS n_profiles
FROM population
GROUP BY 1, 2, 3
),
The table will have the following structure. The table is sorted by the first three columns for demonstration.
channel | cohort_date | elapsed_periods | n_profiles |
---|---|---|---|
release | 20170101 | 0 | 100 |
release | 20170101 | 1 | 90 |
release | 20170101 | 2 | 80 |
... | ... | ... | ... |
beta | 20170128 | 10 | 25 |
Finally, retention is calculated through the number of profiles at the time of
the elapsed_period
relative to the initial period. This data can be imported
into a pivot table for further analysis.
results AS (
SELECT c.*,
iv.n_profiles AS total_n_profiles,
(0.0+c.n_profiles)*100/iv.n_profiles AS percentage_n_profiles
FROM cohorts c
JOIN (
SELECT *
FROM cohorts
WHERE elapsed_periods = 0
) iv ON (
c.cohort_date = iv.cohort_date
AND c.channel = iv.channel
)
)
channel | cohort_date | elapsed_periods | n_profiles | total_n_profiles | percentage_n_profiles |
---|---|---|---|---|---|
release | 20170101 | 0 | 100 | 100 | 1.0 |
release | 20170101 | 1 | 90 | 100 | 0.9 |
release | 20170101 | 2 | 80 | 100 | 0.8 |
... | .... | ... | ... | ... | ... |
beta | 20170128 | 10 | 25 | 50 | 0.5 |
Obtain the results.
SELECT *
FROM results
You may consider visualizing using cohort graphs, line charts, or a pivot tables. See Firefox Telemetry Retention: Dataset Example Usage for more examples.
Scheduling
The aggregated churn data is updated weekly on Wednesday.
Schema
As of 2017-10-15, the current version of churn
is v3
and has a schema as follows:
root
|-- channel: string (nullable = true)
|-- geo: string (nullable = true)
|-- is_funnelcake: string (nullable = true)
|-- acquisition_period: string (nullable = true)
|-- start_version: string (nullable = true)
|-- sync_usage: string (nullable = true)
|-- current_version: string (nullable = true)
|-- current_week: long (nullable = true)
|-- source: string (nullable = true)
|-- medium: string (nullable = true)
|-- campaign: string (nullable = true)
|-- content: string (nullable = true)
|-- distribution_id: string (nullable = true)
|-- default_search_engine: string (nullable = true)
|-- locale: string (nullable = true)
|-- is_active: string (nullable = true)
|-- n_profiles: long (nullable = true)
|-- usage_hours: double (nullable = true)
|-- sum_squared_usage_hours: double (nullable = true)
|-- total_uri_count: long (nullable = true)
|-- unique_domains_count_per_profile: double (nullable = true)
Code Reference
The script for generating churn
currently lives in
mozilla/python_mozetl
. The job can
be found in
mozetl/engagement/churn
.