Skip to content

midhun-murphy/data-warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

End-to-End Data Warehouse using AWS Redshift, S3, Glue & dbt

📌 Overview

This project demonstrates a complete end-to-end data engineering pipeline built using modern tools and best practices.

It implements a Medallion Architecture (Bronze → Silver → Gold) to transform raw data into analytics-ready datasets.


🚀 Tech Stack

  • Storage (Bronze): Amazon S3
  • Metadata & Crawling: AWS Glue Crawler & Data Catalog
  • Data Warehouse: Amazon Redshift (Spectrum + Internal Tables)
  • Transformation Tool: dbt (Data Build Tool)
  • Language: SQL
  • IDE: PyCharm
  • Version Control: Git & GitHub

🏗️ Architecture Overview

S3 (Raw Data)
   ↓
AWS Glue Crawler
   ↓
Glue Data Catalog
   ↓
Redshift Spectrum (External Tables)  ← Bronze Layer
   ↓
dbt (Staging Models)               ← Silver Layer
   ↓
dbt (Business Models)              ← Gold Layer

🥉 Bronze Layer (Raw Data Ingestion)

The Bronze layer stores raw, unprocessed data.

🔹 Components

  • Amazon S3 – stores raw data files (CSV/Parquet)
  • AWS Glue Crawler – scans S3 and detects schema
  • Glue Data Catalog – stores table metadata
  • Redshift Spectrum – queries S3 data using external tables

🔹 Workflow

  1. Raw data is uploaded to S3
  2. Glue Crawler scans and creates tables
  3. Tables are stored in Glue Data Catalog
  4. Redshift accesses data via external schema
  5. dbt reads these tables as sources

🔹 Example Query

SELECT * FROM spectrum_schema.customer_raw LIMIT 10;

🔹 Characteristics

  • Immutable (no updates)
  • Source of truth
  • Stored in raw format

🥈 Silver Layer (Staging & Cleaning)

The Silver layer cleans and standardizes data.

🔹 Transformations

  • Data type casting
  • Null handling
  • Deduplication
  • Basic joins

🔹 Example

SELECT
    customer_id,
    TRIM(customer_name) AS customer_name,
    email
FROM {{ source('bronze', 'customer_raw') }}

🥇 Gold Layer (Business Models)

The Gold layer provides analytics-ready tables.

🔹 Dimension Tables

  • dim_customer
  • dim_account
  • dim_channels
  • dim_currency
  • dim_dates
  • dim_location
  • dim_loans
  • dim_investment_type
  • dim_transaction_type

🔹 Fact Tables

  • fact_customer_interactions
  • fact_daily_balances
  • fact_investments

🔹 Purpose

  • Optimized for reporting
  • Used by BI tools
  • Supports business insights

🔗 Data Flow in dbt

  • Bronze → accessed using source()
  • Silver → referenced using ref()
  • Gold → final business models

✅ Data Quality Testing

Custom dbt tests included:

  • Duplicate checks
  • Null validation
  • Invalid data detection
  • Referential integrity

Example Tests

  • Duplicate account numbers
  • Invalid email format
  • Negative transaction amounts

⚙️ Setup Instructions

1️⃣ Clone Repository

git clone https://github.com/midhun-murphy/data-warehouse.git
cd data-warehouse

2️⃣ Install dbt

pip install dbt-redshift

3️⃣ Configure dbt Profile

Update profiles.yml with:

  • host
  • database
  • user
  • password
  • schema

4️⃣ Run Models

dbt run

5️⃣ Run Tests

dbt test

📈 Key Features

  • End-to-end pipeline (S3 → Glue → Redshift → dbt)
  • Medallion architecture implementation
  • Star schema modeling
  • Modular SQL transformations
  • Automated data quality testing

📌 Use Cases

  • Customer analytics
  • Financial data analysis
  • Investment tracking
  • Business intelligence dashboards

🧠 Learnings

  • Building scalable data pipelines
  • Working with AWS data ecosystem
  • Data modeling (fact & dimension design)
  • dbt best practices
  • Data quality validation

🔮 Future Improvements

  • Add Airflow orchestration
  • Implement incremental models
  • Partition S3 data (year/month/day)
  • Use Parquet for performance
  • Integrate BI tools (Power BI / Tableau)

About

End-to-end data warehouse using AWS S3, Glue, Redshift Spectrum, and dbt (Bronze–Silver–Gold architecture) with data quality testing.

Topics

Resources

Stars

Watchers

Forks

Contributors