1 Day Retention

Introduction

The retention table provides client counts relevant to client retention at a 1-day granularity. The project is tracked in Bug 1381840

Contents

The retention table contains a set of attribute columns used to specify a cohort of users and a set of metric columns to describe cohort activity. Each row contains a permutation of attributes, an approximate set of clients in a cohort, and the aggregate engagement metrics.

This table uses the HyperLogLog (HLL) sketch to create an approximate set of clients in a cohort. HLL allows counting across overlapping cohorts in a single pass while avoiding the problem of double counting. This data-structure has the benefit of being compact and performant in the context of retention analysis, at the expense of precision. For example, calculating a 7-day retention period can be obtained by aggregating over a week of retention data using the union operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT over client_id's in the 7-day window.

Background and Caveats

  1. The data starts at 2017-03-06, the merge date where Nightly started to track Firefox 55 in Mozilla-Central. However, there was not a consistent view into the behavior of first session profiles until the new_profile ping. This means much of the data is inaccurate before 2017-06-26.
  2. This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
  3. Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
  4. The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.

Accessing the Data

The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.

The column should first be cast to the HLL type. The scalar cardinality(<hll_column>) function will approximate the number of unique items per HLL object. The aggregate merge(<hll_column>) function will perform the set union between all objects in a column.

Example: Cast the count column into the appropriate type.

SELECT cast(hll as HLL) as n_profiles_hll FROM retention

Count the number of clients seen over all attribute combinations.

SELECT cardinality(cast(hll as HLL)) FROM retention

Group-by and aggregate client counts over different release channels.

SELECT channel, cardinality(merge(cast(hll AS HLL))
FROM retention
GROUP BY channel

The HyperLogLog library wrappers are available for use outside of the configured STMO environment, spark-hyperloglog and presto-hyperloglog.

Also see the client_count_daily dataset.

Data Reference

Example Queries

See the Example Usage Dashboard for more usages of datasets of the same shape.

Scheduling

The job is scheduled on Airflow on a daily basis after main_summary is run for the day. This job requires both mozetl and telemetry-batch-view as dependencies.

Schema

As of 2017-10-10, the current version of retention is v1 and has a schema as follows:

root
 |-- subsession_start: string (nullable = true)
 |-- profile_creation: string (nullable = true)
 |-- days_since_creation: long (nullable = true)
 |-- channel: string (nullable = true)
 |-- app_version: string (nullable = true)
 |-- geo: string (nullable = true)
 |-- distribution_id: string (nullable = true)
 |-- is_funnelcake: boolean (nullable = true)
 |-- source: string (nullable = true)
 |-- medium: string (nullable = true)
 |-- content: string (nullable = true)
 |-- sync_usage: string (nullable = true)
 |-- is_active: boolean (nullable = true)
 |-- hll: binary (nullable = true)
 |-- usage_hours: double (nullable = true)
 |-- sum_squared_usage_hours: double (nullable = true)
 |-- total_uri_count: long (nullable = true)
 |-- unique_domains_count: double (nullable = true)

Code Reference

The ETL script for processing the data before aggregation is found in mozetl.engagement.retention. The aggregate job is found in telemetry-batch-view as the RetentionView.

The runner script performs all the necessary setup to run on EMR. This script can be used to perform backfill.