View on GitHub

DataEng-Data-Pipeline-with-Airflow

Airflow data pipeline to transform source data in AWS S3 to data warehouse in Redshift.

Airflow Data Pipeline for AWS S3 -> Redshift

Project by Berk Hakbilen

Introduction

A project to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills.

The source data resides in S3 and needs to be processed in Sparkify’s data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

DAG

Project Directory Structure

airflow-data-pipeline
│   README.md                    # Readme
│   docker-compose.yml           # Docker container configuration  
│   requirements.txt             # Requreiments file including python dependencies
└───airflow                      # Main airflow directory
|   |               
│   └───dags                     
│   |      main_dag.py  #Python file to create and configure the DAG
|   |      create_tables.sql  #SQL queries to create the staging,fact and dimension tables on redshift
|   └───plugins
│       │  
|       └───helpers
|       |      sql_queries.py     # SQL queries to insert data into tables
|       |
|       └───operators
|       |      data_quality.py    # DataQualityOperator
|       |      load_dimension.py  # LoadDimensionOperator
|       |      load_fact.py       # LoadFactOperator
|       |      stage_redshift.py  # StageToRedshiftOperator

Requirements

Python3
Docker
Docker Compose
AWS account and Redshift cluster

Quick Start

Clone repository to local machine

clone the repository to your local directory

Create python virtual environment

python3 -m venv venv             # create virtualenv
source venv/bin/activate         # activate virtualenv
pip install -r requirements.txt  # install requirements

Start the docker container

Feel free to run the docker container directly if you are satisfied with default configuration.

docker-compose up

Check the Airflow UI

Go to http://localhost:8080

Username: user

Password: password

Configure Airflow connections for AWS and Redshift

Click on the Admin tab and select Connections.
Under Connections, select Create.
On the create connection page, enter the following values:
- Conn Id: Enter aws_credentials.
- Conn Type: Enter Amazon Web Services.
- Login: Enter your Access key ID from the IAM User credentials.
- Password: Enter your Secret access key from the IAM User credentials.

Once you’ve entered these values, select Save and Add Another.

On the next create connection page, enter the following values:
- Conn Id: Enter redshift.
- Conn Type: Enter Postgres.
- Host: Enter the endpoint of your Redshift cluster without the port at the end.
- Schema: Enter the Redshift database you want to connect to.
- Password: Enter the password you created when launching your Redshift cluster.
- Port: Enter 5439. Once you’ve entered these values, select Save.

Start the DAG

Start the DAG by switching it state from OFF to ON.

Refresh and click on the main_dag to view the current state.

Check for any errors that may come up in the pipeline.