A full-stack data pipeline to extract, transform, and load Reddit data into Amazon Redshift using Apache Airflow, Celery, PostgreSQL, S3, AWS Glue, and Athena.
This end-to-end data pipeline is built to automate the ETL (Extract, Transform, Load) process for Reddit data. The flow involves pulling data from Reddit’s API, staging it in S3, transforming it using AWS Glue and Athena, and loading the final dataset into Amazon Redshift for analytics.
The pipeline includes the following components:
- Reddit API – Acts as the primary data source.
- Apache Airflow + Celery – Handles orchestration and distributed task execution.
- PostgreSQL – Stores intermediate data and Airflow metadata.
- Amazon S3 – Stores raw JSON files pulled from Reddit.
- AWS Glue – Manages the schema and performs transformation tasks.
- Amazon Athena – Runs SQL-based transformations directly on S3.
- Amazon Redshift – Final destination for analytics-ready structured data.
Ensure you have the following:
- An AWS account with access to S3, Glue, Athena, and Redshift
- Reddit API credentials
- Python 3.9+ installed
- Docker and Docker Compose installed
Clone the repository:
git clone https://github.com/airscholar/RedditDataEngineering.git
cd RedditDataEngineeringCreate and activate a virtual environment:
python3 -m venv venv
source venv/bin/activateInstall Python dependencies:
pip install -r requirements.txtConfigure credentials:
mv config/config.conf.example config/config.conf
# Edit config.conf with your Reddit and AWS credentialsStart the services using Docker Compose:
docker-compose up -dAccess the Airflow UI:
http://localhost:8080Use Airflow to trigger and monitor the ETL DAG that processes Reddit data from extraction to loading in Redshift.
