This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and services including Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, Amazon Athena, and Amazon Redshift.
- Data Pipeline with Reddit, Airflow, Celery, Postgres, S3, AWS Glue, Athena, and Redshift
- Some Dashboards
The pipeline is designed to:
- Extract data from Reddit using its API.
- Store the raw data into an S3 bucket from Airflow.
- Transform the data using AWS Glue and Amazon Athena.
- Load the transformed data into Amazon Redshift for analytics and querying.
- Reddit API: Source of the data.
- Apache Airflow & Celery: Orchestrates the ETL process and manages task distribution.
- PostgreSQL: Temporary storage and metadata management.
- Amazon S3: Raw data storage.
- AWS Glue: Data cataloging and ETL jobs.
- Amazon Athena: SQL-based data transformation.
- Amazon Redshift: Data warehousing and analytics.
- AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift.
- Reddit API credentials.
- Docker Installation.
- Python 3.9 or higher (e.g: 3.10.16).
- Clone the repository.
git clone https://github.com/bdbao/etl-reddit.git
- Create a virtual environment.
python -m venv venv
- Activate the virtual environment.
source venv/bin/activate - Install the dependencies.
pip install -r requirements.txt
- Rename the configuration file and the credentials to the file.
mv config/config.conf.example config/config.conf
- Starting the containers
docker-compose up -d
- Launch the Airflow web UI.
open http://localhost:8080
- With fish:
python3.10 -m venv venv source venv/bin/activate.fish - How to Get
reddit_secret_keyandreddit_client_idforconfig.conf:- Log in to Reddit
- Go to https://www.reddit.com.
- Log in with your Reddit account.
- Access the Developer Portal
- Create a New Application
- Scroll down and click "Create App" or "Create Another App".
- Name: Provide a name for your app (e.g.,
"Reddit ETL App"). - App Type: Select script.
- Description: Provide a short description of your app.
- About URL: (Optional) Leave blank.
- Redirect URI: Set it to
http://localhost:8080. (Required for OAuth but can be left as is for now.) - Permissions: No additional permissions are needed for basic data extraction.
- Retrieve Client ID and Secret Key
- Once created, the Client ID will appear at the top of the application page (a string under your app name and "personal use script").
- The Secret Key will be listed under the
"secret"section.
- Log in to Reddit
- How to Get AWS Credentials and Bucket Details for
config.conf:- Create or Access Your AWS Account
- Go to https://aws.amazon.com and log in to your AWS account.
- If you don't have an AWS account, create a new one.
- Access AWS IAM (Identity and Access Management)
- Open the IAM Console.
- Navigate to the Users section.
- Either select an existing user or create a new one if necessary.
- Get Your AWS Access Key and Secret Key
-
If creating a new user:
-
Assign the necessary permissions (e.g.,
AmazonS3FullAccess,AmazonS3ReadOnlyAccess). -
After creating the user, you will be shown the Access Key ID and Secret Access Key.
-
Copy them securely, as you won't be able to view them again.
-
If using an existing user:
-
Go to Users > Security Credentials tab.
-
Under the Access Keys section, you can create a new key if none exists.
-
- Get the AWS Session Token (Optional for Temporary Credentials)
- If you're using temporary credentials, you will need a Session Token.
- Generate temporary credentials using the AWS CLI:
aws sts get-session-token --duration-seconds 3600
- Set the AWS Region
- The AWS Region is the data center where your S3 bucket resides.
- Example: us-east-1 for US East (N. Virginia).
- Get the S3 Bucket Name
- To find your S3 bucket name:
- Go to the S3 Console.
- Locate the name of your existing bucket.
- If creating a new bucket:
- Click Create Bucket.
- Provide a unique bucket name and select a region.
- Complete the other settings and create the bucket.
- To find your S3 bucket name:
- Create or Access Your AWS Account
open localhost:8080(username/pw is admin)- Open AWS S3 to see the output of Airflow run.
- Open AWS Glue.
- Add Role:
Steps to Ensure the IAM Role is Available: Check IAM Role Creation: Go to the IAM Management Console (https://console.aws.amazon.com/iam/). Under Roles, search for AWSGlueServiceRole. If it doesn’t exist, create it by following the steps below. Create a New IAM Role for Glue: Click Create Role. Choose AWS Service -> Glue. Select Glue use case and click Next. Attach the managed policy AWSGlueServiceRole. (Optional) Add additional permissions for S3 access if needed (AmazonS3FullAccess or fine-tuned permissions). Name the role AWSGlueServiceRole and create it. Attach Required Policies: Ensure the following policies are attached to the role: AWSGlueServiceRole AmazonS3FullAccess (if working with S3) CloudWatchLogsFullAccess (for logging) Refresh the available IAM roles list -> Choose. - Edit script of Glue job, see in some-scripts/reddit_glue_job.py.
- Open AWS Athena.
- Open Amazone Redship.
- Click on Query Data, direct to sqlworkbench.
COPY dev.public.reddit_data_eng FROM 's3://bdbao-redditdataengineering/transformed/run-1736567065233-part-r-00000' IAM_ROLE 'arn:aws:iam::816069126714:role/service-role/AmazonRedshift-CommandsAccessRole-20250111T112632' FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS 'us-east-1' SELECT * FROM "dev"."public"."reddit_data_eng"; SELECT * FROM "awsdatacatalog"."reddit_db"."transformed";
- Add Role:


