Reddit Data Engineering Pipeline 🚀

A full-stack data pipeline to extract, transform, and load Reddit data into Amazon Redshift using Apache Airflow, Celery, PostgreSQL, S3, AWS Glue, and Athena.

📌 Table of Contents

Project Summary
Architecture
Prerequisites
Environment Setup
Running the Pipeline
Demo

📄 Project Summary

This end-to-end data pipeline is built to automate the ETL (Extract, Transform, Load) process for Reddit data. The flow involves pulling data from Reddit’s API, staging it in S3, transforming it using AWS Glue and Athena, and loading the final dataset into Amazon Redshift for analytics.

🏗 Architecture

The pipeline includes the following components:

Reddit API – Acts as the primary data source.
Apache Airflow + Celery – Handles orchestration and distributed task execution.
PostgreSQL – Stores intermediate data and Airflow metadata.
Amazon S3 – Stores raw JSON files pulled from Reddit.
AWS Glue – Manages the schema and performs transformation tasks.
Amazon Athena – Runs SQL-based transformations directly on S3.
Amazon Redshift – Final destination for analytics-ready structured data.

Architecture diagram

✅ Prerequisites

Ensure you have the following:

An AWS account with access to S3, Glue, Athena, and Redshift
Reddit API credentials
Python 3.9+ installed
Docker and Docker Compose installed

⚙️ Environment Setup

Clone the repository:

git clone https://github.com/airscholar/RedditDataEngineering.git
cd RedditDataEngineering

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install Python dependencies:

pip install -r requirements.txt

Configure credentials:

mv config/config.conf.example config/config.conf
# Edit config.conf with your Reddit and AWS credentials

🚀 Running the Pipeline

Start the services using Docker Compose:

docker-compose up -d

Access the Airflow UI:

http://localhost:8080

Use Airflow to trigger and monitor the ETL DAG that processes Reddit data from extraction to loading in Redshift.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
config		config
dags		dags
data/output		data/output
etls		etls
logs		logs
pipelines		pipelines
utils		utils
Dockerfile		Dockerfile
README.md		README.md
airflow.env		airflow.env
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Data Engineering Pipeline 🚀

📌 Table of Contents

📄 Project Summary

🏗 Architecture

✅ Prerequisites

⚙️ Environment Setup

🚀 Running the Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reddit Data Engineering Pipeline 🚀

📌 Table of Contents

📄 Project Summary

🏗 Architecture

✅ Prerequisites

⚙️ Environment Setup

🚀 Running the Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages