Heavy Users

As of 2018-05-18, this dataset has been deprecated and is no longer maintained. See Bug 1455314

Replacement

We've moved to assigning user's an active tag based on total_uri_count, see the Active DAU definition.

The activity of a user based on active_ticks is available in clients_daily in the active_hours_sum field, which has the sum(active_ticks / 720).

To retrieve a client's 28-day active_hours, use the following query:

SELECT submission_date_s3,
       client_id,
       SUM(active_hours_sum) OVER (PARTITION BY client_id
                                   ORDER BY submission_date_s3 ASC
                                   ROWS 27 PRECEDING) AS monthly_active_hours
FROM
    clients_daily

Introduction

The heavy_users table provides information about whether a given client_id is considered a "heavy user" on each day (using submission date).

Contents

The heavy_users table contains one row per client-day, where day is submission_date. A client has a row for a specific submission_date if they were active at all in the 28 day window ending on that submission_date.

A user is a "heavy user" as of day N if, for the 28 day period ending on day N, the sum of their active_ticks is in the 90th percentile (or above) of all clients during that period. For more analysis on this, and a discussion of new profiles, see this link.

Background and Caveats

  1. Data starts at 20170801. There is technically data in the table before this, but the heavy_user column is NULL for those dates because it needed to bootstrap the first 28 day window.
  2. Because it is top the 10% of clients for each 28 day period, more than 10% of clients active on a given submission_date will be considered heavy users. If you join with another data source (main_summary, for example), you may see a larger proportion of heavy users than expected.
  3. Each day has a separate, but related, set of heavy users. Initial investigations show that approximately 97.5% of heavy users as of a certain day are still considered heavy users as of the next day.
  4. There is no "fixing" or weighting of new profiles - days before the profile was created are counted as zero active_ticks. Analyses may need to use the included profile_creation_date field to take this into account.

Accessing the Data

The data is available both via sql.t.m.o and Spark.

In Spark:

spark.read.parquet("s3://telemetry-parquet/heavy_users/v1")

In SQL:

SELECT * FROM heavy_users LIMIT 3

Further Reading

The code responsible for generating this dataset is here

Data Reference

Example Queries

Example queries:

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job DAG runs every day after main_summary is complete. You can find the job definition here.

Schema

As of 2017-10-05, the current version of the heavy_users dataset is v1, and has a schema as follows:

root
 |-- client_id: string (nullable = true)
 |-- sample_id: integer (nullable = true)
 |-- profile_creation_date: long (nullable = true)
 |-- active_ticks: long (nullable = true)
 |-- active_ticks_period: long (nullable = true)
 |-- heavy_user: boolean (nullable = true)
 |-- prev_year_heavy_user: boolean (nullable = true)
 |-- submission_date_s3: string (nullable = true)

Code Reference

This dataset is generated by telemetry-batch-view. Refer to this repository for information on how to run or augment the dataset.