Creating Your Own Dataset to Query in re:dash

  1. Create a spark notebook that does the transformations you need, either on raw data (using Dataset API) or on parquet data
  2. Output the results of that to an S3 location, usually telemetry-parquet/user/$YOUR_DATASET/v$VERSION_NUMBER/submission_date=$YESTERDAY/. This would partition by submission_date, meaning each day this runs and is outputted to a new location in S3. Do NOT put the submission_date in the parquet file as well! A column name cannot also be the name of a partition. Partitioning is optional, but datasets should have a version in the path.
  3. Using this template, open a bug to publish the dataset (making it available in Spark and Re:dash) with the following attributes:
    • Add whiteboard tag [DataOps]
    • Title: "Publish dataset"
    • Content: Location of the dataset in S3 (from step 2 above) and the desired table name