Client Count Daily Reference
This document is a work in progress. The work is being tracked here.
Introduction
The client_count_daily
dataset is useful for estimating user counts over a few
pre-defined dimensions.
The client_count_daily
dataset is similar to the deprecated
client_count
dataset
except that is aggregated by submission date and not activity date.
Content
This dataset includes columns for a dozen factors and an HLL variable.
The hll
column contains a
HyperLogLog
variable, which is an approximation to the exact count.
The factor columns include submission date and the dimensions listed
here.
Each row represents one combinations of the factor columns.
Background and Caveats
It's important to understand that the hll
column is not a standard count.
The hll
variable avoids double-counting users when aggregating over multiple days.
The HyperLogLog variable is a far more efficient way to count distinct elements of a set,
but comes with some complexity.
To find the cardinality of an HLL use cardinality(cast(hll AS HLL))
.
To find the union of two HLL's over different dates, use merge(cast(hll AS HLL))
.
The Firefox ER Reporting Query
is a good example to review.
Finally, Roberto has a relevant write-up
here.
Accessing the Data
The data is available in Re:dash. Take a look at this example query.
I don't recommend accessing this data from ATMO.
Further Reading
Data Reference
Example Queries
Compute DAU for non-windows clients for the last week
WITH sample AS (
SELECT
os,
submission_date,
cardinality(merge(cast(hll AS HLL))) AS count
FROM client_count_daily
WHERE submission_date >= DATE_FORMAT(CURRENT_DATE - INTERVAL '7' DAY, '%Y%m%d')
GROUP BY
submission_date,
os
)
SELECT
os,
-- formatting date as late as possible improves performance dramatically
date_parse(submission_date, '%Y%m%d') AS submission_date,
count
FROM sample
WHERE
count > 10 -- remove outliers
AND lower(os) NOT LIKE '%windows%'
ORDER BY
os,
submission_date DESC
Compute WAU by Channel for the last week
WITH dau AS (
SELECT
normalized_channel,
submission_date,
merge(cast(hll AS HLL)) AS hll
FROM client_count_daily
-- 2 days of lag, 7 days of results, and 6 days preceding for WAU
WHERE submission_date > DATE_FORMAT(CURRENT_DATE - INTERVAL '15' DAY, '%Y%m%d')
GROUP BY
submission_date,
normalized_channel
),
wau AS (
SELECT
normalized_channel,
submission_date,
cardinality(merge(hll) OVER (
PARTITION BY normalized_channel
ORDER BY submission_date
ROWS BETWEEN 6 PRECEDING AND 0 FOLLOWING
)) AS count
FROM dau
)
SELECT
normalized_channel,
-- formatting date as late as possible improves performance dramatically
date_parse(submission_date, '%Y%m%d') AS submission_date,
count
FROM wau
WHERE
count > 10 -- remove outliers
AND submission_date > DATE_FORMAT(CURRENT_DATE - INTERVAL '9' DAY, '%Y%m%d') -- only days that have a full WAU
Caveats
The hll
column does not product an exact count. hll
stands for
HyperLogLog, a sophisticated
algorithm that allows for the counting of extremely high numbers of items,
sacrificing a small amount of accuracy in exchange for using much less memory
than a simple counting structure.
When count is calculated over a column that may change over time, such as
total_uri_count_threshold
, then a client would be counted in every group
where they appear. Over longer windows, like MAU, this is more likely to occur.
Scheduling
This dataset is updated daily via the
telemetry-airflow infrastructure.
The job runs as part of the main_summary
DAG.
Schema
The data is partitioned by submission_date
which is formatted as %Y%m%d
,
like 20180130
.
As of 2018-03-15, the current version of the client_count_daily
dataset
is v2
, and has a schema as follows:
root
|-- app_name: string (nullable = true)
|-- app_version: string (nullable = true)
|-- country: string (nullable = true)
|-- devtools_toolbox_opened: boolean (nullable = true)
|-- e10s_enabled: boolean (nullable = true)
|-- hll: binary (nullable = true)
|-- locale: string (nullable = true)
|-- normalized_channel: string (nullable = true)
|-- os: string (nullable = true)
|-- os_version: string (nullable = true)
|-- top_distribution_id: string (nullable = true)
|-- total_uri_count_threshold: integer (nullable = true)