This project demonstrates a complete end-to-end data engineering pipeline built using modern tools and best practices.
It implements a Medallion Architecture (Bronze → Silver → Gold) to transform raw data into analytics-ready datasets.
- Storage (Bronze): Amazon S3
- Metadata & Crawling: AWS Glue Crawler & Data Catalog
- Data Warehouse: Amazon Redshift (Spectrum + Internal Tables)
- Transformation Tool: dbt (Data Build Tool)
- Language: SQL
- IDE: PyCharm
- Version Control: Git & GitHub
S3 (Raw Data)
↓
AWS Glue Crawler
↓
Glue Data Catalog
↓
Redshift Spectrum (External Tables) ← Bronze Layer
↓
dbt (Staging Models) ← Silver Layer
↓
dbt (Business Models) ← Gold Layer
The Bronze layer stores raw, unprocessed data.
- Amazon S3 – stores raw data files (CSV/Parquet)
- AWS Glue Crawler – scans S3 and detects schema
- Glue Data Catalog – stores table metadata
- Redshift Spectrum – queries S3 data using external tables
- Raw data is uploaded to S3
- Glue Crawler scans and creates tables
- Tables are stored in Glue Data Catalog
- Redshift accesses data via external schema
- dbt reads these tables as sources
SELECT * FROM spectrum_schema.customer_raw LIMIT 10;- Immutable (no updates)
- Source of truth
- Stored in raw format
The Silver layer cleans and standardizes data.
- Data type casting
- Null handling
- Deduplication
- Basic joins
SELECT
customer_id,
TRIM(customer_name) AS customer_name,
email
FROM {{ source('bronze', 'customer_raw') }}The Gold layer provides analytics-ready tables.
- dim_customer
- dim_account
- dim_channels
- dim_currency
- dim_dates
- dim_location
- dim_loans
- dim_investment_type
- dim_transaction_type
- fact_customer_interactions
- fact_daily_balances
- fact_investments
- Optimized for reporting
- Used by BI tools
- Supports business insights
- Bronze → accessed using
source() - Silver → referenced using
ref() - Gold → final business models
Custom dbt tests included:
- Duplicate checks
- Null validation
- Invalid data detection
- Referential integrity
- Duplicate account numbers
- Invalid email format
- Negative transaction amounts
git clone https://github.com/midhun-murphy/data-warehouse.git
cd data-warehousepip install dbt-redshiftUpdate profiles.yml with:
- host
- database
- user
- password
- schema
dbt rundbt test- End-to-end pipeline (S3 → Glue → Redshift → dbt)
- Medallion architecture implementation
- Star schema modeling
- Modular SQL transformations
- Automated data quality testing
- Customer analytics
- Financial data analysis
- Investment tracking
- Business intelligence dashboards
- Building scalable data pipelines
- Working with AWS data ecosystem
- Data modeling (fact & dimension design)
- dbt best practices
- Data quality validation
- Add Airflow orchestration
- Implement incremental models
- Partition S3 data (year/month/day)
- Use Parquet for performance
- Integrate BI tools (Power BI / Tableau)