Heavy Users
As of 2018-05-18, this dataset has been deprecated and is no longer maintained. See Bug 1455314
Replacement
We've moved to assigning user's an active tag based on total_uri_count
, see
the Active DAU definition.
The activity of a user based on active_ticks
is available in clients_daily
in the active_hours_sum
field, which has the sum(active_ticks / 720)
.
To retrieve a client's 28-day active_hours
, use the following query:
SELECT submission_date_s3,
client_id,
SUM(active_hours_sum) OVER (PARTITION BY client_id
ORDER BY submission_date_s3 ASC
ROWS 27 PRECEDING) AS monthly_active_hours
FROM
clients_daily
Introduction
The heavy_users
table provides information about whether a given client_id
is
considered a "heavy user" on each day (using submission date).
Contents
The heavy_users
table contains one row per client-day, where day is
submission_date
. A client has a row for a specific submission_date
if
they were active at all in the 28 day window ending on that submission_date
.
A user is a "heavy user" as of day N if, for the 28 day period ending
on day N, the sum of their active_ticks
is in the 90th percentile (or
above) of all clients during that period. For more analysis on this,
and a discussion of new profiles, see
this link.
Background and Caveats
- Data starts at 20170801. There is technically data in the table before
this, but the
heavy_user
column isNULL
for those dates because it needed to bootstrap the first 28 day window. - Because it is top the 10% of clients for each 28 day period, more
than 10% of clients active on a given
submission_date
will be considered heavy users. If you join with another data source (main_summary
, for example), you may see a larger proportion of heavy users than expected. - Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
- There is no "fixing" or weighting of new profiles - days before the
profile was created are counted as zero
active_ticks
. Analyses may need to use the includedprofile_creation_date
field to take this into account.
Accessing the Data
The data is available both via sql.t.m.o
and Spark.
In Spark:
spark.read.parquet("s3://telemetry-parquet/heavy_users/v1")
In SQL:
SELECT * FROM heavy_users LIMIT 3
Further Reading
The code responsible for generating this dataset is here
Data Reference
Example Queries
Example queries:
- Join
heavy_users
withmain_summary
to get distribution ofmax_concurrent_tab_count
for heavy vs. non-heavy users - Join
heavy_users
withlongitudinal
to get crash rates for heavy vs. non-heavy users
Scheduling
This dataset is updated daily via the telemetry-airflow infrastructure.
The job DAG runs every day after main_summary
is complete.
You can find the job definition
here.
Schema
As of 2017-10-05, the current version of the heavy_users
dataset is v1
, and has a schema as follows:
root
|-- client_id: string (nullable = true)
|-- sample_id: integer (nullable = true)
|-- profile_creation_date: long (nullable = true)
|-- active_ticks: long (nullable = true)
|-- active_ticks_period: long (nullable = true)
|-- heavy_user: boolean (nullable = true)
|-- prev_year_heavy_user: boolean (nullable = true)
|-- submission_date_s3: string (nullable = true)
Code Reference
This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.