Welcome to the Medallion Architect Challenge Project! This project implements a production-grade Medallion Architecture using PySpark, Airflow, and MinIO. It transforms the Brazilian E-commerce Public Dataset by Olist into a high-performance, analytics-ready Data Platform with Hybrid Cloud capabilities.
The pipeline implements an end-to-end flow with rigorous observability and data quality checks:
graph TD
subgraph "Source Systems"
S1[(Production DB)] --> |JDBC Incremental| B
S2[Holiday API] --> |REST| G
S3[Web App] --> |JSON Events| K[Kafka Broker]
end
subgraph "Data Platform (Local/Docker)"
B(Bronze: Raw Parquet) --> C(Silver: Cleansed & Deduplicated)
K --> |Spark Streaming| C
C --> D(Gold: Star Schema & SCD Type 2)
subgraph "Observability"
Audit[(Audit DB)] <--> |Logging| B
Audit <--> |Logging| C
Audit <--> |Logging| D
D --> |Drift Check| Audit
K -.-> UI[Kafka-UI]
end
end
subgraph "Cloud Analytics"
D --> |Export| BQ[(Google BigQuery)]
end
subgraph "Orchestration & Alerts"
G[Apache Airflow] -.-> B
G -.-> C
G -.-> D
G -.-> BQ
G --> |Webhook| DS[Discord Notifications]
end
- π₯ Bronze (Raw): Captures the original data from source files "as-is" into MinIO.
- π₯ Silver (Cleansed): Deduplication, null handling, precision data typing, and Final-State Schema Enforcement.
- π₯ Gold (Curated): Business-level aggregates, Star Schema (Facts & Dimensions), and Feature Tables with Modeled Schema Validation.
The project implements a Star Schema in the Gold layer, optimized for analytical queries.
Note: The schema includes core dimensions (Customers, Products, Sellers, Location, Date) and a central Fact table (Orders), along with specialized Feature tables for behavioral analysis.
- Observability & Governance:
- Audit Logging: Every pipeline run is logged to a PostgreSQL
audit.run_logstable (timing, row counts, status). - Data Quality (DQ): Automated Multi-Layered Schema Validation (Bronze, Silver, Gold), PK uniqueness checks, null value monitoring, and Row-Count Drift Detection.
- Alerting: Real-time Discord notifications for pipeline success and failures.
- Audit Logging: Every pipeline run is logged to a PostgreSQL
- Advanced Data Modeling:
- SCD Type 2: History tracking for Customers, Products, and Sellers.
- Point-in-Time Join Logic: Fact tables use range joins to link to the corresponding version of historical dimensions.
- Star Schema: Optimization for BI tool performance.
- Feature Engineering: Pre-computed RFM (Recency, Frequency, Monetary) tables.
- Real-World Reliability:
- Incremental Loading (CDC): Watermark-based ingestion fetching only new/updated records to minimize I/O and compute overhead.
- Architectural Trade-off: This pipeline prioritizes performance and scalability by avoiding "Full Scan" hard-delete detection. In a production environment, hard-deletes are typically propagated via Log-Based CDC (e.g., Debezium) or Soft Deletes (
is_deletedflags) to ensure the Medallion layers stay synchronized without the O(N) cost of full-table comparisons.
- Architectural Trade-off: This pipeline prioritizes performance and scalability by avoiding "Full Scan" hard-delete detection. In a production environment, hard-deletes are typically propagated via Log-Based CDC (e.g., Debezium) or Soft Deletes (
- Idempotency: All jobs are designed to be safe for re-execution.
- Parallel Execution: Multi-threaded Spark jobs for maximized throughput.
- Incremental Loading (CDC): Watermark-based ingestion fetching only new/updated records to minimize I/O and compute overhead.
- Real-Time Streaming Layer (Lambda Architecture):
- Kafka-based Ingestion: Real-time events (e.g., Product Reviews) streamed via Kafka with a custom Self-Managing Lifecycle.
- Coordinated Shutdown: Uses a signal-file mechanism (
stop_simulation.signal) to synchronize the Python Producer and Spark Consumer, ensuring the consumer only stops once the simulation is complete. - Fault-Tolerant Monitoring: Implements a Patience Timeout logic that prevents the streaming consumer from hanging if records are missed (e.g., when using
latestoffsets). - Structured Streaming: Spark-native streaming job that cleanses and merges live events into the Silver tier with full audit logging support.
- Unified Analytical View: The Gold layer automatically unions historical batch data with real-time streams for up-to-the-minute analysis.
For a detailed technical walkthrough of each phase (ideal for interview preparation), explore the following documentation:
| Phase | Focus Area | Detailed Documentation |
|---|---|---|
| Phase 1 | Source & Governance | Source Simulation & Audit Logic |
| Phase 2 | Bronze Ingestion | Incremental Watermarking & Parallelism |
| Phase 3 | Silver Transformation | Precision Tuning & CDC Upserts |
| Phase 4 | Gold Modeling | Star Schema & SCD Type 2 |
| Phase 5 | Streaming Lambda | Real-Time Lifecycle & Coordination |
| Phase 6 | Cloud & Observability | BigQuery Sync & Discord Alerting |
- Processing: Apache Spark 4.0 (PySpark)
- Orchestration: Apache Airflow
- Storage: MinIO (S3-compatible) & Google BigQuery
- Audit/Metadata: PostgreSQL 17
- Infrastructure: Docker & Docker Compose
- Alerts: Discord Webhooks
.
βββ airflow/ # Airflow configuration & DAGs
β βββ dags/ # medallion_pipeline.py (The main DAG)
β βββ Dockerfile # Custom Airflow image with Spark support
βββ spark/ # Spark Jobs & Configuration
β βββ jobs/ # bronze, silver, and gold scripts
β βββ extra_jars/ # JDBC & S3 Connectors
β βββ config.py # Centralized configuration & DQ logic
βββ warehouse/ # SQL Initialization & Analytics
β βββ init_multiple_dbs.sql # Multi-DB setup (Audit)
β βββ visualization_queries.sql # Performance & RFM queries
βββ raw/ # Kaggle source datasets
βββ docker-compose.yaml # Full stack orchestration
βββ README.md # You are here!
- Docker Desktop with at least 4GB of RAM allocated.
- A GCP Service Account Key (
application_default_credentials.json) for BigQuery export.
Since the Spark JARs are excluded via .gitignore, you must download them manually and place them in the spark/extra_jars/ directory:
| JAR File | Version | Download Link |
|---|---|---|
| PostgreSQL JDBC | 42.7.4 | Download |
| Hadoop AWS | 3.4.1 | Download |
| AWS SDK Bundle | 2.23.19 | Download |
| Spark-SQL Kafka | 4.0.2 | Download |
| Kafka Clients | 3.9.0 | Download |
| Spark Token Kafka | 4.0.2 | Download |
| Commons Pool 2 | 2.12.0 | Download |
docker-compose up -d| Service | URL | User | Password |
|---|---|---|---|
| Airflow UI | http://localhost:8088 | admin |
admin |
| MinIO Console | http://localhost:9001 | admin |
password |
| Spark Master | http://localhost:8080 | - | - |
| Kafka UI | http://localhost:8090 | - | - |
The pipeline is designed to simulate a real-world enterprise environment where data originates from a production database and flows into a hybrid cloud analytical platform.
Before running the main pipeline, you must simulate the production environment:
- Locate the DAG
setup_source_dbin the Airflow UI. - Trigger it to run
seed_source.py. This script:- Creates the
olistschema in the source PostgreSQL container. - Seeds it with raw Kaggle data.
- Adds
updated_atwatermarks to enable CDC (Change Data Capture) simulation.
- Creates the
Once the source is seeded, trigger the medallion_pipeline DAG. The sequence is:
fetch_holidays: Pulls external API data fordim_dateenrichment.bronze_ingestion: Performs incremental JDBC fetching from the Source DB using watermarks.silver_processing: Executes deduplication, precision casting, and Final-State Silver Schema Enforcement.gold_modeling: Builds the Fact/Dimension/Feature tables with SCD Type 2 logic and Star-Schema Structural Validation.export_to_bigquery: Synchronizes the final Gold layer to Google BigQuery.
The platform features a real-time review stream that can run concurrently with the batch pipeline or as a standalone simulation.
- Locate the DAG
streaming_pipelinein the Airflow UI. - Trigger it. The workflow includes:
reset_streaming_data: Wipes previous simulation state in MinIO and S3 checkpoints.init_kafka_topic: Ensures theolist_reviews_streamtopic exists.review_consumer_task: Launches the Structured Streaming consumer (kafka_consumer_reviews.py) which ingests Kafka events into the Silver tier.review_producer_task: Executes the Producer Simulation (kafka_producer_reviews.py) sending a batch of Olist reviews to Kafka.
- Graceful Shutdown: The consumer automatically detects when the producer is finished via a signal file and shuts itself down once the target row count is met.
This project focuses on measurable "Senior-level" KPIs across performance, cost, and reliability:
| Metric Category | Key Result | Technical Impact |
|---|---|---|
| Pipeline Performance | 22% Runtime Reduction | End-to-end duration dropped from 04:27 to 03:28 via incremental loading and multi-threading. |
| Data Throughput | ~300% Concurrency Gain | Used ThreadPoolExecutor to process 8+ tables in parallel stages instead of sequential loops. |
| Cloud Cost Savings | ~80% Optimization | Reduced BigQuery storage fees by keeping Bronze/Silver layers in on-prem MinIO (S3) storage. |
| Storage Efficiency | 10x Compression | Achieved significant footprint reduction via Parquet format and precise schema typing (Short/Decimal). |
| Data Integrity | Zero Structural Drift | Implemented Fail-Fast Schema Validation at every layer (Bronze/Silver/Gold) to prevent corrupted data propagation. |
| Observability | <5s Detection (MTTD) | Automated industry-standard alerting via Discord webhooks and row-count drift detection. |
You may notice several warnings in the Spark logs. Most of these are expected in a containerized S3 environment:
jdk.incubator.vector: Spark 4.0 uses modern JVM optimizations. Safe to ignore.NativeCodeLoader: Spark falls back to Java-native compression if C++ libraries aren't found in the container. No impact on correctness.S3ABlockOutputStream (Syncable API): S3 is an object store. Gracefully handled viadowngrade.syncable.exceptions.
Architected for the Medallion Challenge Data Platform.
