Cloud Data Warehouse
Project by Berk Hakbilen
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
The task here is building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.
Overview
In this project, I’lL apply what I’ve learned on data warehouses and AWS to build an ETL pipeline for a database hosted on Redshift. To complete the project, I will need to load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.
Datasets
We’ll be working with two datasets that reside in S3. Here are the S3 links for each:
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
- Log data json path: s3://udacity-dend/log_json_path.json
Song Dataset
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song’s track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json
Content of a sample file: {“num_songs”: 1, “artist_id”: “ARJIE2Y1187B994AB7”, “artist_latitude”: null, “artist_longitude”: null, “artist_location”: “”, “artist_name”: “Line Renaud”, “song_id”: “SOUPIRU12A6D4FA1E1”, “title”: “Der Kleine Dompfaff”, “duration”: 152.92036, “year”: 0}
Log Dataset
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
The log files in the dataset we’ll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset. log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json
Content of a sample file: {“artist”: null, “auth”: “Logged In”, “firstName”: “Walter”, “gender”: “M”, “itemInSession”: 0, “lastName”: “Frye”, “length”: null, “level”: “free”, “location”: “San Francisco-Oakland-Hayward, CA”, “method”: “GET”,”page”: “Home”, “registration”: 1540919166796.0, “sessionId”: 38, “song”: null, “status”: 200, “ts”: 1541105830796, “userAgent”: “"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36"”, “userId”: “39”}
Schema for Song Play Analysis
Using the song and log datasets, we’ll create a star schema optimized for queries on song play analysis. This includes the following tables.
Fact Table
songplays - records in log data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables
users - users in the app
user_id, first_name, last_name, gender, level
songs - songs in music database
song_id, title, artist_id, year, duration
artists - artists in music database
artist_id, name, location, latitude, longitude
time - timestamps of records in songplays broken down into specific units
start_time, hour, day, week, month, year, weekday
Project Files
sql_queries.py -> is where we’ll define you SQL statements, which will be imported into the two other files above.
create_tables.py -> is where we’ll create our fact and dimension tables for the star schema in Redshift.
etl.py -> is where we’ll load data from S3 into staging tables on Redshift and then process that data into our analytics tables on Redshift.
Envronment requirements
- Python 3.6
How to run
First create a Redshift cluster on AWS Then setup the configuration file ‘dwh.cfg’ with the necessary information such as host,db_name, db_user, db_password,db_port, ARN (IAM role).
To create the tables (or also reset the tables)
python create_tables.py
to run the ETL:
python etl.py