Data Modeling with Postgres
Project by Berk Hakbilen
Overview
In this project, we apply Data Modeling with Postgres and build an ETL pipeline using Python. A startup wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. Currently, they are collecting data in json format and the analytics team is particularly interested in understanding what songs users are listening to.
Song Dataset
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
Content of a sample file: {“num_songs”: 1, “artist_id”: “ARJIE2Y1187B994AB7”, “artist_latitude”: null, “artist_longitude”: null, “artist_location”: “”, “artist_name”: “Line Renaud”, “song_id”: “SOUPIRU12A6D4FA1E1”, “title”: “Der Kleine Dompfaff”, “duration”: 152.92036, “year”: 0}
Log Dataset
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
Content of a sample file: {“artist”: null, “auth”: “Logged In”, “firstName”: “Walter”, “gender”: “M”, “itemInSession”: 0, “lastName”: “Frye”, “length”: null, “level”: “free”, “location”: “San Francisco-Oakland-Hayward, CA”, “method”: “GET”,”page”: “Home”, “registration”: 1540919166796.0, “sessionId”: 38, “song”: null, “status”: 200, “ts”: 1541105830796, “userAgent”: “"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36"”, “userId”: “39”}
Schema for Song Play Analysis
Using the song and log datasets, we’ll create a star schema optimized for queries on song play analysis. This includes the following tables.
Fact Table
songplays - records in log data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables
users - users in the app
user_id, first_name, last_name, gender, level
songs - songs in music database
song_id, title, artist_id, year, duration
artists - artists in music database
artist_id, name, location, latitude, longitude
time - timestamps of records in songplays broken down into specific units
start_time, hour, day, week, month, year, weekday
Project Files
sql_queries.py -> contains all sql queries for dropping and creating fact and dimension tables as well as a select query.
create_tables.py -> drops and creates the tables. Should be used to reset the tables before running the etl.py script again.
etl.ipynb -> reads and processes a single file from song_data and log_data and loads the data into the tables. This notebook contains detailed information on the ETL process for each of the tables.
etl.py -> reads and processes files from song_data and log_data and loads them into your tables.
test.ipynb -> displays the first few rows of each table to let you check your database.
Envronment requirements
- Python 3.6
- psql (PostgreSQL) 9.5.23
How to run
To run the main program:
python main.py
To create the tables (or also reset the tables)
python create_tables.py
to run the ETL:
python etl.py