Data Modeling with Postgres

Project by Berk Hakbilen

Overview

In this project, we apply Data Modeling with Postgres and build an ETL pipeline using Python. A startup wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. Currently, they are collecting data in json format and the analytics team is particularly interested in understanding what songs users are listening to.

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

Content of a sample file: {“num_songs”: 1, “artist_id”: “ARJIE2Y1187B994AB7”, “artist_latitude”: null, “artist_longitude”: null, “artist_location”: “”, “artist_name”: “Line Renaud”, “song_id”: “SOUPIRU12A6D4FA1E1”, “title”: “Der Kleine Dompfaff”, “duration”: 152.92036, “year”: 0}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

Content of a sample file: {“artist”: null, “auth”: “Logged In”, “firstName”: “Walter”, “gender”: “M”, “itemInSession”: 0, “lastName”: “Frye”, “length”: null, “level”: “free”, “location”: “San Francisco-Oakland-Hayward, CA”, “method”: “GET”,”page”: “Home”, “registration”: 1540919166796.0, “sessionId”: 38, “song”: null, “status”: 200, “ts”: 1541105830796, “userAgent”: “"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36"”, “userId”: “39”}

Schema for Song Play Analysis

Using the song and log datasets, we’ll create a star schema optimized for queries on song play analysis. This includes the following tables.

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app

user_id, first_name, last_name, gender, level

songs - songs in music database

song_id, title, artist_id, year, duration

artists - artists in music database

artist_id, name, location, latitude, longitude

time - timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Project Files

sql_queries.py -> contains all sql queries for dropping and creating fact and dimension tables as well as a select query.

create_tables.py -> drops and creates the tables. Should be used to reset the tables before running the etl.py script again.

etl.ipynb -> reads and processes a single file from song_data and log_data and loads the data into the tables. This notebook contains detailed information on the ETL process for each of the tables.

etl.py -> reads and processes files from song_data and log_data and loads them into your tables.

test.ipynb -> displays the first few rows of each table to let you check your database.

Envronment requirements

Python 3.6
psql (PostgreSQL) 9.5.23

How to run

To run the main program: python main.py

To create the tables (or also reset the tables) python create_tables.py

to run the ETL: python etl.py