Working with Parquet
This guide will give you a quick introduction to working with Parquet files at Mozilla. You can also refer to Spark's documentation on the subject here.
Most of our derived datasets,
like the longitudinal
or main_summary
tables,
are stored in Parquet files.
You can access these datasets in re:dash,
but you may want to access the data from an
ATMO cluster
if SQL isn't powerful enough for your analysis
or if a sample of the data will not suffice.
Table of Contents
Reading Parquet Tables
Spark provides native support for reading parquet files.
The result of loading a parquet file is a
DataFrame.
For example, you can load main_summary
with the following snippet:
# Parquet files are self-describing so the schema is preserved.
main_summary = spark.read.parquet('s3://telemetry-parquet/main_summary/v1/')
You can find the S3 path for common datasets in Choosing a Dataset or in the reference documentation.
Writing Parquet Tables
Saving a table to parquet is a great way to share an intermediate dataset.
Where to save data
You can save data to a subdirectory of the following bucket:
s3://net-mozaws-prod-us-west-2-pipeline-analysis/<username>/
Use your username for the subdirectory name.
This bucket is available to all ATMO clusters and Airflow.
When your analysis is production ready,
open a PR against python_mozetl
.
How to save data
You can save the DataFrame test_dataframe
to the telemetry-test-bucket
with the following command:
test_dataframe.write.mode('error') \
.parquet('s3://telemetry-test-bucket/my_subdir/table_name')
Note: data saved to s3://telemetry-test-bucket
will automatically be deleted
after 30 days.