SSL Ratios

Introduction

The public SSL dataset publishes the percentage of page loads Firefox users have performed that were conducted over SSL. This dataset is used to produce graphs like Let's Encrypt's to determine SSL adoption on the Web over time.

Content

The public SSL dataset is a table where each row is a distinct set of dimensions, with their associated SSL statistics. The dimensions are submission_date, os, and country. The statistics are reporting_ratio, normalized_pageloads, and ratio.

Background and Caveats

  • We're using normalized values in normalized_pageloads to obscure absolute page load counts.
  • This is across the entirety of release, not per-version, because we're looking at Web health, not Firefox user health.
  • Any dimension tuple (any given combination of submission_date, os, and country) with fewer than 5000 page loads is omitted from the dataset.
  • This is hopefully just a temporary dataset to stopgap release aggregates going away until we can come up with a better way to publicly publish datasets.

Accessing the Data

For details on accessing the data, please look at bug 1414839.

Data Reference

Combining Rows

This is a dataset of ratios. You can't combine ratios if they have different bases. For example, if 50% of 10 loads (5 loads) were SSL and 5% of 20 loads (1 load) were SSL, you cannot calculate that 20% (6 loads) of the total loads (30 loads) were SSL unless you know that the 50% was for 10 and the 5% was for 20.

If you're reluctant, for product reasons, to share the numbers 10 and 20, this gets tricky.

So what we've done is normalize the whole batch of 30 down to 1. That means we tell you that 50% of one-third of the loads (0.333...) was SSL and 5% of the other two-thirds of the loads (0.666...) was SSL. Then you can figure out the overall 20% figure by this calculation:

0.5 * 0.333 + 0.05 * 0.666 = 0.2

For this dataset the same rule applies. To combine rows' ratios (to, for example, see what the SSL ratio was across all os and country for a given submission_date), you must first multiply them by the rows' normalized_pageviews values.

Or, in JavaScript:

let rows = query_result.data.rows;
let ratioForDateInQuestion = rows
  .filter(row => row.submission_date == dateInQuestion)
  .reduce((row, acc) => acc + row.normalized_pageloads * row.ratio, 0);

Schema

The data is output in re:dash API format:

"query_result": {
  "retrieved_at": <timestamp>,
  "query_hash": <hash>,
  "query": <SQL>,
  "runtime": <number of seconds>,
  "id": <an id>,
  "data_source_id": 26, // Athena
  "data_scanned": <some really large number, as a string>,
  "data": {
    "data_scanned": <some really large number, as a number>,
    "columns": [
      {"friendly_name": "submission_date", "type": "datetime", "name": "submission_date"},
      {"friendly_name": "os", "type": "string", "name": "os"},
      {"friendly_name": "country", "type": "string", "name": "country"},
      {"friendly_name": "reporting_ratio", "type": "float", "name": "reporting_ratio"},
      {"friendly_name": "normalized_pageloads", "type": "float", "name": "normalized_pageloads"},
      {"friendly_name": "ratio", "type": "float", "name": "ratio"}
    ],
    "rows": [
      {
        "submission_date": "2017-10-24T00:00:00", // date string, day resolution
        "os": "Windows_NT", // operating system family of the clients reporting the pageloads. One of "Windows_NT", "Linux", or "Darwin".
        "country": "CZ", // ISO 639 two-character country code, or "??" if we have no idea. Determined by performing a geo-IP lookup of the clients that submitted the pings.
        "reporting_ratio": 0.006825266611977031, // the ratio of pings that reported any pageloads at all. A number between 0 and 1. See [bug 1413258](https://bugzilla.mozilla.org/show_bug.cgi?id=1413258).
        "normalized_pageloads": 0.00001759145263985348, // the proportion of total pageloads in the dataset that are represented by this row. Provided to allow combining rows. A number between 0 and 1.
        "ratio": 0.6916961976822144 // the ratio of the pageloads that were performed over SSL. A number between 0 and 1.
      }, ...
    ]
  }
}

Scheduling

The dataset updates every 24 hours.

Code Reference

You can find the query that generates the SSL dataset here.