Skip to content

Latest commit

 

History

History
30 lines (18 loc) · 1002 Bytes

File metadata and controls

30 lines (18 loc) · 1002 Bytes

SparkPipeline

Project Structure

Read data from S3

  • Song data: s3://udacity-dend/song_data
  • Log data: s3://udacity-dend/log_data The script reads song_data and load_data from S3.

Process data using spark

Transforms them to create five different tables listed below :

Fact Table

*songplays - records in log data associated with song plays i.e. records with page NextSong

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  • users - users in the app Fields - user_id, first_name, last_name, gender, level

  • songs - songs in music database Fields - song_id, title, artist_id, year, duration

  • artists - artists in music database Fields - artist_id, name, location, lattitude, longitude

  • time - timestamps of records in songplays broken down into specific units Fields - start_time, hour, day, week, month, year, weekday

Load it back to S3

Writes them to partitioned parquet files in table directories on S3.