Working with Parquet

This guide will give you a quick introduction to working with Parquet files at Mozilla. You can also refer to Spark's documentation on the subject here.

Most of our derived datasets, like the longitudinal or main_summary tables, are stored in Parquet files. You can access these datasets in re:dash, but you may want to access the data from an ATMO cluster if SQL isn't powerful enough for your analysis or if a sample of the data will not suffice.

Table of Contents

Reading Parquet Tables

Spark provides native support for reading parquet files. The result of loading a parquet file is a DataFrame. For example, you can load main_summary with the following snippet:

# Parquet files are self-describing so the schema is preserved.
main_summary = spark.read.parquet('s3://telemetry-parquet/main_summary/v1/')

You can find the S3 path for common datasets in Choosing a Dataset or in the reference documentation.

Writing Parquet Tables

Saving a table to parquet is a great way to share an intermediate dataset.

Where to save data

You can save data to a subdirectory of the following bucket: s3://net-mozaws-prod-us-west-2-pipeline-analysis/<username>/ Use your username for the subdirectory name. This bucket is available to all ATMO clusters and Airflow.

When your analysis is production ready, open a PR against python_mozetl.

How to save data

You can save the DataFrame test_dataframe to the telemetry-test-bucket with the following command:

test_dataframe.write.mode('error') \
    .parquet('s3://telemetry-test-bucket/my_subdir/table_name')

Note: data saved to s3://telemetry-test-bucket will automatically be deleted after 30 days.

Accessing Parquet Tables from Re:dash

See Creating a custom re:dash dataset.