1 Day Retention
Introduction
The retention
table provides client counts relevant to client retention at a
1-day granularity. The project is tracked in Bug 1381840
Contents
The retention
table contains a set of attribute columns used to specify a
cohort of users and a set of metric columns to describe cohort activity. Each
row contains a permutation of attributes, an approximate set of clients in a
cohort, and the aggregate engagement metrics.
This table uses the HyperLogLog (HLL) sketch to create an approximate set of
clients in a cohort. HLL allows counting across overlapping cohorts in a single
pass while avoiding the problem of double counting. This data-structure has the
benefit of being compact and performant in the context of retention analysis,
at the expense of precision. For example, calculating a 7-day retention period
can be obtained by aggregating over a week of retention data using the union
operation. With SQL primitive, this requires a recalculation of COUNT DISTINCT
over client_id
's in the 7-day window.
Background and Caveats
- The data starts at 2017-03-06, the merge date where Nightly started to
track Firefox 55 in Mozilla-Central. However, there was
not a consistent view into the behavior of first session profiles until the
new_profile
ping. This means much of the data is inaccurate before 2017-06-26. - This dataset uses 4 day reporting latency to aggregate at least 99% of the data in a given submission date. This figure is derived from the telemetry-health measurements on submission latency, with the discussion in Bug 1407410. This latency metric was reduced Firefox 55 with the introduction of the shutdown ping-sender mechanism.
- Caution should be taken before adding new columns. Additional attribute columns will grow the number of rows exponentially.
- The number of HLL bits chosen for this dataset is 13. This means the default size of the HLL object is 2^13 bits or 1KiB. This maintains about a 1% error on average. See this table from Algebird's HLL implementation for more details.
Accessing the Data
The data is primarily available through Re:dash on STMO via the Presto source. This service has been configured to use predefined HLL functions.
The column should first be cast to the HLL type. The scalar
cardinality(<hll_column>)
function will approximate the number of unique
items per HLL object. The aggregate merge(<hll_column>)
function will perform
the set union between all objects in a column.
Example: Cast the count column into the appropriate type.
SELECT cast(hll as HLL) as n_profiles_hll FROM retention
Count the number of clients seen over all attribute combinations.
SELECT cardinality(cast(hll as HLL)) FROM retention
Group-by and aggregate client counts over different release channels.
SELECT channel, cardinality(merge(cast(hll AS HLL))
FROM retention
GROUP BY channel
The HyperLogLog library wrappers are available for use outside of the
configured STMO environment, spark-hyperloglog
and
presto-hyperloglog
.
Also see the client_count_daily
dataset.
Data Reference
Example Queries
See the Example Usage Dashboard for more usages of datasets of the same shape.
Scheduling
The job is scheduled on Airflow on a daily basis after main_summary
is run
for the day. This job requires both mozetl
and telemetry-batch-view
as
dependencies.
Schema
As of 2017-10-10, the current version of retention
is v1
and has a schema
as follows:
root
|-- subsession_start: string (nullable = true)
|-- profile_creation: string (nullable = true)
|-- days_since_creation: long (nullable = true)
|-- channel: string (nullable = true)
|-- app_version: string (nullable = true)
|-- geo: string (nullable = true)
|-- distribution_id: string (nullable = true)
|-- is_funnelcake: boolean (nullable = true)
|-- source: string (nullable = true)
|-- medium: string (nullable = true)
|-- content: string (nullable = true)
|-- sync_usage: string (nullable = true)
|-- is_active: boolean (nullable = true)
|-- hll: binary (nullable = true)
|-- usage_hours: double (nullable = true)
|-- sum_squared_usage_hours: double (nullable = true)
|-- total_uri_count: long (nullable = true)
|-- unique_domains_count: double (nullable = true)
Code Reference
The ETL script for processing the data before aggregation is found in
mozetl.engagement.retention
. The aggregate job is found in
telemetry-batch-view as the RetentionView
.
The runner script performs all the necessary setup to run on EMR. This script can be used to perform backfill.