Sending a Custom Ping
Got some new data you want to send to us? How in the world do you send a new ping? Follow this guide to find out.
Write Your Questions
Do not try and implement new pings unless you know specifically what questions you're trying to answer. General questions about "How do users use our product?" won't cut it - these need to be specific, concrete asks that can be translated to data points. This will also make it easier down the line as you start data review.
More detail on how to design and implement new pings for Firefox Desktop can be found here.
Choose a Namespace and DocType
For new telemetry pings, the namespace is simply telemetry
. For non-Telemetry
pings, choose a namespace that uniquely identifies the product that will be
generating the data.
The DocType is used to differentiate pings within a namespace. It can be as
simple as event
, but should generally be descriptive of the data being
collected.
Both namespace and DocType are limited to the pattern [a-zA-Z-]
. In other words, hyphens and letters from the ISO basic Latin alphabet.
Create a Schema
Use JSON Schema to start with. See the "Adding a new schema" documentation and examples schemas in the Mozilla Pipeline Schemas repo. This schema is just used to validate the incoming data; any ping that doesn't match the schema will be removed. Validate your JSON Schema using a validation tool.
We already have automatic deduplicating based on docId
, which catches about 90% of duplicates and
removes them from the dataset.
Start a Data Review
Data review for new pings is more complicated than when adding new probes. See Data Review for Focus-Event Ping as an example. Consider where the data falls in the Data Collection Categories.
Submit Schema to mozilla-services/mozilla-pipeline-schemas
The first schema added should be the JSON Schema made in step 2. Add at least one example ping which the data can be validated against. These test pings will be validated automatically during the build.
Additionally,
a Parquet output
schema should be added. This would add a new dataset, available in Re:dash.
The best documentation we have for the Parquet schema is by looking at the examples in
mozilla-pipeline-schemas
.
Parquet output also has a metadata
section. These are fields added to the ping at ingestion time;
they might come from the URL submitted to the edge server, or the IP Address used to make the request.
This document
lists available metadata fields for all pings.
The stream you're interested in is probably telemetry
.
For example, look at system-addon-deployment-diagnostics
immediately under the telemetry
top-level
field. The schema
element has top-level fields (e.g. Timestamp
, Type
), as well as more fields
under the Fields
element. Any of these can be used in the metadata
section of your parquet schema,
except for submission
.
Some common ones for Telemetry data might be:
Date
submissionDate
geoCountry
geoCity
geoSubdivision1
geoSubdivision2
normalizedChannel
appVersion
appBuildId
And for non-Telemetry data:
geoCountry
geoCity
geoSubdivision1
geoSubdivision2
documentId
Important Note: Schema evolution of nested structs is currently broken, so you will not be able to add
any fields in the future to your metadata
section. We recommend adding any that may seem useful.
Testing The Schema
For new data, use the edge validator to test your schema.
If your data is already being sent, and you want to test the schema you're writing on the data
that is currently being ingested, you can test your Parquet output in
Hindsight by using an output plugin.
See Core ping output plugin
for an example, where the parquet schema is specified as parquet_schema
. If no errors arise, that
means it should be correct. The "Deploy" button should not be used to actually deploy, that will be
done by operations in the next step.
(Telemetry-specific) Deploy the Plugin
File a bug to deploy the new schema.
Real-time analysis will be key to ensuring your data is being processed and parsed correctly.
It should follow the format specified in
MozTelemetry docType
monitor.
This allows you to check validation errors, size changes, duplicates, and more. Once you have
the numbers set, file a
bug to let ops deploy it.
Start Sending Data
If you're using the Telemetry APIs, use those built-in. These can be with the Gecko Telemetry APIs, the Android Telemetry APIs, or the iOS Telemetry APIs.
For non-Telemetry data, see our HTTP edge server specification
and specifically the non-Telemetry example for the expected format. The edge
server endpoint is https://incoming.telemetry.mozilla.org
.
(Non-Telemetry) Access Your Data
First confirm with the reviewers of your schema pull request that your schemas have been deployed.
In the following links, replace <namespace>
, <doctype>
And <docversion>
with
appropriate values. Also replace -
with _
in <namespace>
if your
namespace contains -
characters.
CEP
Once you've sent some pings, refer to the following real-time analysis plugins to verify that your data is being processed:
https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.moz_generic_error_monitor.<namespace>.html
https://pipeline-cep.prod.mozaws.net/dashboard_output/analysis.moz_generic_<namespace>_<doctype>_<docversion>.submissions.json
https://pipeline-cep.prod.mozaws.net/dashboard_output/analysis.moz_generic_<namespace>_<doctype>_<docversion>.errors.txt
If this first graph shows ingestion errors, you can view the corresponding error messages in the third link. Otherwise, you should be able to view the last ten processed submissions via the second link. You can also write your own custom real-time analysis plugins using this same infrastructure if you desire; use the above plugins as examples and see here for a more detailed explanation.
If you encounter schema validation errors, you can fix your data or
submit another pull request
to amend your schemas. Backwards-incompatible schema changes should generally
be accompanied by an increment to docversion
.
Once you've established that your pings are flowing through the real-time system, verify that you can access the data from the downstream systems.
STMO
In the Athena data source, a new table
<namespace>_<doctype>_parquet_<docversion>
will be created for your data. A
convenience pointer <namespace>_<doctype>_parquet
will also refer to the latest
available docversion
of the ping. The data is partitioned by
submission_date_s3
which is formatted as %Y%m%d
, like 20180130
, and is
generally updated hourly. Refer to the STMO documentation
for general information about using Re:dash.
This table may take up to a day to appear in the Athena source; if you still don't see a table for your new ping after 24 hours, contact Data Operations so that they can investigate. Once the table is available, it should contain all the pings sent during that first day, regardless of how long it takes for the table to appear.
ATMO
The data should be available in S3 at:
s3://net-mozaws-prod-us-west-2-pipeline-data/<namespace>-<doctype>-parquet/v<docversion>/
Note: here <namespace>
should not be escaped.
Refer to the Spark FAQ for details on accessing this table via ATMO.
Write ETL Jobs
We have some basic generalized ETL jobs you can use to transform your data on a batch basis - for example, a Longitudinal or client-count-daily like dataset. Otherwise, you'll have to write your own.
You can schedule it on Airflow, or you can run it as a job in ATMO. If the output is parquet, you can add it to the Hive metastore to have it available in re:dash. Check the docs on creating your own datasets.
Build Dashboards Using ATMO or STMO
Last steps! What are you using this data for anyway?