Churn

Introduction

The churn dataset tracks the 7-day churn rate of telemetry profiles. This dataset is generally used for analyzing cohort churn across segments and time.

Content

Churn is the rate of attrition defined by (clients seen in week N)/(clients seen in week 0) for groups of clients with some shared attributes. A group of clients with shared attributes is called a cohort. The cohorts in this dataset are created every week and can be tracked over time using the acquisition_date and the weeks since acquisition or current_week.

The following example demonstrates the current logic for generating this dataset. Each column represents the days since some arbitrary starting date.

client 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14
A X X
B X X X X X X
C X X

All three clients are part of the same cohort. Client A is retained for weeks 0 and 1 since there is activity in both periods. A client only needs to show up once in the period to be counted as retained. Client B is acquired in week 0 and is active frequently but does not appear in following weeks. Client B is considered churned on week 1. However, a client that is churned can become retained again. Client C is considered churned on week 1 but retained on week 3.

The following table summarizes the above daily activity into the following view where every column represents the current week since acquisition date..

client 0 1 2
A X X
B X
C X X

The clients are then grouped into cohorts by attributes. An attribute describes a property about the cohort such as the country of origin or the binary distribution channel. Each group also contains descriptive aggregates of engagement. Each metric describes the activity of a cohort such as size and overall usage at a given time instance.

Background and Caveats

  • Each row in this dataset describes a unique segment of users
    • The number of rows is exponential with the number of dimensions
    • New fields should be added sparing to account for data-set size
  • The dataset lags by 10 days in order account for submission latency
    • This value was determined to be time for 99% of main pings to arrive at the server. With the shutdown-ping sender, this has been reduced to 4 days. However, churn_v3 still tracks releases older than Firefox 55.
  • The start of the period is fixed to Sundays. Once it has been aggregated, the period cannot be shifted due to the way clients are counted.
    • A supplementary 1-day retention dataset using HyperLogLog for client counts is available for counting over arbitrary retention periods and date offsets. Additionally, calculating churn or retention over specific cohorts is tractable in STMO with main_summary or clients_daily datasets.

Accessing the Data

churn is available in Re:dash under Athena and Presto. The data is also available in parquet for consumption by columnar data engines at s3://telemetry-parquet/churn/v3.

Data Reference

Example Queries

This section walks through a typical query to generate data suitable for visualization.

field type description
cohort_date common, attribute The start date bucket of the cohort. This is week the client was acquired.
elapsed_periods common, attribute The number of periods that have elapsed since the cohort date. In this dataset, the retention period is 7 days.
channel attribute Part of the release train model. An attribute that distinguishes cohorts.
geo filter attribute Country code. Used to filter out all countries other than the 'US'
n_profiles metric Count of users in a cohort. Use sum to aggregate.

First the fields are extracted and aliased for consistency. cohort_date and elapsed_periods are common to most retention queries and are useful concepts for building on other datasets.

WITH extracted AS (
    SELECT acquisition_period AS cohort_date,
           current_week AS elapsed_periods,
           n_profiles,
           channel,
           geo
    FROM churn
),

The extracted table is filtered down to the attributes of interest. The cohorts of interest originate in the US and are in the release or beta channels. Note that channel here is the concatenation of the normalized channel and the funnelcake id. Only cohorts appearing after August 6, 2017 are chosen to be in this population.

 population AS (
    SELECT channel,
           cohort_date,
           elapsed_periods,
           n_profiles
    FROM extracted
    WHERE geo = 'US'
      AND channel IN ('release', 'beta')
      AND cohort_date > '20170806'
      -- filter out noise from clients with incorrect dates
      AND elapsed_periods >= 0
      AND elapsed_periods < 12
),

The number of profiles is aggregated by the cohort dimensions. The cohort acquisition date and elapsed periods since acquisition are fundamental to cohort analysis.

 cohorts AS (
     SELECT channel,
            cohort_date,
            elapsed_periods,
            sum(n_profiles) AS n_profiles
     FROM population
     GROUP BY 1, 2, 3
),

The table will have the following structure. The table is sorted by the first three columns for demonstration.

channel cohort_date elapsed_periods n_profiles
release 20170101 0 100
release 20170101 1 90
release 20170101 2 80
... ... ... ...
beta 20170128 10 25

Finally, retention is calculated through the number of profiles at the time of the elapsed_period relative to the initial period. This data can be imported into a pivot table for further analysis.

results AS (
    SELECT c.*,
           iv.n_profiles AS total_n_profiles,
           (0.0+c.n_profiles)*100/iv.n_profiles AS percentage_n_profiles
    FROM cohorts c
    JOIN (
        SELECT *
        FROM cohorts
        WHERE elapsed_periods = 0
    ) iv ON (
        c.cohort_date = iv.cohort_date
        AND c.channel = iv.channel
    )
)
channel cohort_date elapsed_periods n_profiles total_n_profiles percentage_n_profiles
release 20170101 0 100 100 1.0
release 20170101 1 90 100 0.9
release 20170101 2 80 100 0.8
... .... ... ... ... ...
beta 20170128 10 25 50 0.5

Obtain the results.

SELECT *
FROM results

You may consider visualizing using cohort graphs, line charts, or a pivot tables. See Firefox Telemetry Retention: Dataset Example Usage for more examples.

Scheduling

The aggregated churn data is updated weekly on Wednesday.

Schema

As of 2017-10-15, the current version of churn is v3 and has a schema as follows:

root
 |-- channel: string (nullable = true)
 |-- geo: string (nullable = true)
 |-- is_funnelcake: string (nullable = true)
 |-- acquisition_period: string (nullable = true)
 |-- start_version: string (nullable = true)
 |-- sync_usage: string (nullable = true)
 |-- current_version: string (nullable = true)
 |-- current_week: long (nullable = true)
 |-- source: string (nullable = true)
 |-- medium: string (nullable = true)
 |-- campaign: string (nullable = true)
 |-- content: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- default_search_engine: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- is_active: string (nullable = true)
 |-- n_profiles: long (nullable = true)
 |-- usage_hours: double (nullable = true)
 |-- sum_squared_usage_hours: double (nullable = true)
 |-- total_uri_count: long (nullable = true)
 |-- unique_domains_count_per_profile: double (nullable = true)

Code Reference

The script for generating churn currently lives in mozilla/python_mozetl. The job can be found in mozetl/engagement/churn.