End-to-End Data Warehouse using AWS Redshift, S3, Glue & dbt

📌 Overview

This project demonstrates a complete end-to-end data engineering pipeline built using modern tools and best practices.

It implements a Medallion Architecture (Bronze → Silver → Gold) to transform raw data into analytics-ready datasets.

🚀 Tech Stack

Storage (Bronze): Amazon S3
Metadata & Crawling: AWS Glue Crawler & Data Catalog
Data Warehouse: Amazon Redshift (Spectrum + Internal Tables)
Transformation Tool: dbt (Data Build Tool)
Language: SQL
IDE: PyCharm
Version Control: Git & GitHub

🏗️ Architecture Overview

S3 (Raw Data)
   ↓
AWS Glue Crawler
   ↓
Glue Data Catalog
   ↓
Redshift Spectrum (External Tables)  ← Bronze Layer
   ↓
dbt (Staging Models)               ← Silver Layer
   ↓
dbt (Business Models)              ← Gold Layer

🥉 Bronze Layer (Raw Data Ingestion)

The Bronze layer stores raw, unprocessed data.

🔹 Components

Amazon S3 – stores raw data files (CSV/Parquet)
AWS Glue Crawler – scans S3 and detects schema
Glue Data Catalog – stores table metadata
Redshift Spectrum – queries S3 data using external tables

🔹 Workflow

Raw data is uploaded to S3
Glue Crawler scans and creates tables
Tables are stored in Glue Data Catalog
Redshift accesses data via external schema
dbt reads these tables as sources

🔹 Example Query

SELECT * FROM spectrum_schema.customer_raw LIMIT 10;

🔹 Characteristics

Immutable (no updates)
Source of truth
Stored in raw format

🥈 Silver Layer (Staging & Cleaning)

The Silver layer cleans and standardizes data.

🔹 Transformations

Data type casting
Null handling
Deduplication
Basic joins

🔹 Example

SELECT
    customer_id,
    TRIM(customer_name) AS customer_name,
    email
FROM {{ source('bronze', 'customer_raw') }}

🥇 Gold Layer (Business Models)

The Gold layer provides analytics-ready tables.

🔹 Dimension Tables

dim_customer
dim_account
dim_channels
dim_currency
dim_dates
dim_location
dim_loans
dim_investment_type
dim_transaction_type

🔹 Fact Tables

fact_customer_interactions
fact_daily_balances
fact_investments

🔹 Purpose

Optimized for reporting
Used by BI tools
Supports business insights

🔗 Data Flow in dbt

Bronze → accessed using source()
Silver → referenced using ref()
Gold → final business models

✅ Data Quality Testing

Custom dbt tests included:

Duplicate checks
Null validation
Invalid data detection
Referential integrity

Example Tests

Duplicate account numbers
Invalid email format
Negative transaction amounts

⚙️ Setup Instructions

1️⃣ Clone Repository

git clone https://github.com/midhun-murphy/data-warehouse.git
cd data-warehouse

2️⃣ Install dbt

pip install dbt-redshift

3️⃣ Configure dbt Profile

Update profiles.yml with:

host
database
user
password
schema

4️⃣ Run Models

dbt run

5️⃣ Run Tests

dbt test

📈 Key Features

End-to-end pipeline (S3 → Glue → Redshift → dbt)
Medallion architecture implementation
Star schema modeling
Modular SQL transformations
Automated data quality testing

📌 Use Cases

Customer analytics
Financial data analysis
Investment tracking
Business intelligence dashboards

🧠 Learnings

Building scalable data pipelines
Working with AWS data ecosystem
Data modeling (fact & dimension design)
dbt best practices
Data quality validation

🔮 Future Improvements

Add Airflow orchestration
Implement incremental models
Partition S3 data (year/month/day)
Use Parquet for performance
Integrate BI tools (Power BI / Tableau)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
analyses		analyses
assets		assets
macros		macros
models		models
seeds		seeds
snapshots		snapshots
tests		tests
.gitignore		.gitignore
README.md		README.md
dbt_project.yml		dbt_project.yml
er_diagram.png		er_diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Data Warehouse using AWS Redshift, S3, Glue & dbt

📌 Overview

🚀 Tech Stack

🏗️ Architecture Overview

🥉 Bronze Layer (Raw Data Ingestion)

🔹 Components

🔹 Workflow

🔹 Example Query

🔹 Characteristics

🥈 Silver Layer (Staging & Cleaning)

🔹 Transformations

🔹 Example

🥇 Gold Layer (Business Models)

🔹 Dimension Tables

🔹 Fact Tables

🔹 Purpose

🔗 Data Flow in dbt

✅ Data Quality Testing

Example Tests

⚙️ Setup Instructions

1️⃣ Clone Repository

2️⃣ Install dbt

3️⃣ Configure dbt Profile

4️⃣ Run Models

5️⃣ Run Tests

📈 Key Features

📌 Use Cases

🧠 Learnings

🔮 Future Improvements

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

End-to-End Data Warehouse using AWS Redshift, S3, Glue & dbt

📌 Overview

🚀 Tech Stack

🏗️ Architecture Overview

🥉 Bronze Layer (Raw Data Ingestion)

🔹 Components

🔹 Workflow

🔹 Example Query

🔹 Characteristics

🥈 Silver Layer (Staging & Cleaning)

🔹 Transformations

🔹 Example

🥇 Gold Layer (Business Models)

🔹 Dimension Tables

🔹 Fact Tables

🔹 Purpose

🔗 Data Flow in dbt

✅ Data Quality Testing

Example Tests

⚙️ Setup Instructions

1️⃣ Clone Repository

2️⃣ Install dbt

3️⃣ Configure dbt Profile

4️⃣ Run Models

5️⃣ Run Tests

📈 Key Features

📌 Use Cases

🧠 Learnings

🔮 Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!