Skip to content

cntejaswini/dataengineering-reddit-etl-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit Data Engineering Pipeline 🚀

A full-stack data pipeline to extract, transform, and load Reddit data into Amazon Redshift using Apache Airflow, Celery, PostgreSQL, S3, AWS Glue, and Athena.


📌 Table of Contents


📄 Project Summary

This end-to-end data pipeline is built to automate the ETL (Extract, Transform, Load) process for Reddit data. The flow involves pulling data from Reddit’s API, staging it in S3, transforming it using AWS Glue and Athena, and loading the final dataset into Amazon Redshift for analytics.


🏗 Architecture

The pipeline includes the following components:

  • Reddit API – Acts as the primary data source.
  • Apache Airflow + Celery – Handles orchestration and distributed task execution.
  • PostgreSQL – Stores intermediate data and Airflow metadata.
  • Amazon S3 – Stores raw JSON files pulled from Reddit.
  • AWS Glue – Manages the schema and performs transformation tasks.
  • Amazon Athena – Runs SQL-based transformations directly on S3.
  • Amazon Redshift – Final destination for analytics-ready structured data.

Architecture diagram Screenshot 2025-06-23 at 11 31 10 AM


✅ Prerequisites

Ensure you have the following:

  • An AWS account with access to S3, Glue, Athena, and Redshift
  • Reddit API credentials
  • Python 3.9+ installed
  • Docker and Docker Compose installed

⚙️ Environment Setup

Clone the repository:

git clone https://github.com/airscholar/RedditDataEngineering.git
cd RedditDataEngineering

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install Python dependencies:

pip install -r requirements.txt

Configure credentials:

mv config/config.conf.example config/config.conf
# Edit config.conf with your Reddit and AWS credentials

🚀 Running the Pipeline

Start the services using Docker Compose:

docker-compose up -d

Access the Airflow UI:

http://localhost:8080

Use Airflow to trigger and monitor the ETL DAG that processes Reddit data from extraction to loading in Redshift.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors