Create a spark notebook that does the transformations you need, either on
raw data (using Dataset API) or on parquet data
Output the results of that to an S3 location, usually
telemetry-parquet/user/$YOUR_DATASET/v$VERSION_NUMBER/submission_date=$YESTERDAY/.
This would partition by submission_date, meaning each day this runs and is
outputted to a new location in S3. Do NOT put the submission_date in the
parquet file as well! A column name cannot also be the name of a partition.
Partitioning is optional, but datasets should have a version in the path.
Using this template,
open a bug to publish the dataset (making it available in Spark and Re:dash) with the following attributes:
Add whiteboard tag [DataOps]
Title: "Publish dataset"
Content: Location of the dataset in S3 (from step 2 above) and the desired table name