This Repo contain coursework and final project for the Data Engineering Zoomcamp by DataTalks.Club (Cohort 2026).
The Data Engineering Zoomcamp is a free course covering the fundamentals of data engineering — from containerization and infrastructure-as-code to batch and stream processing. It is taught by Alexey Grigorev and the DataTalks.Club team.
| Area | Tools |
|---|---|
| Containerization | Docker, Docker Compose |
| Infrastructure as Code | Terraform (GCP & AWS) |
| Workflow Orchestration | Kestra |
| Data Ingestion | dlt (data load tool) |
| Data Warehouse | BigQuery, DuckDB |
| Cloud Storage | AWS S3 (Parquet, partitioned) |
| Analytics Engineering | dbt |
| Data Platforms | Bruin |
| Batch Processing | Apache Spark |
| Stream Processing | Apache Kafka, Apache Flink |
| Dashboard | Streamlit, Plotly |
| Language | Python, SQL |
All homework solutions for each module are in the cohorts/2026/ directory:
| Module | Topic | Homework |
|---|---|---|
| 1 | Docker & Terraform | |
| 2 | Workflow Orchestration (Kestra) | |
| 3 | Data Warehouse (BigQuery) | |
| 4 | Analytics Engineering (dbt) | |
| 5 | Data Platforms (Bruin) | |
| 6 | Batch Processing (Spark) | |
| 7 | Stream Processing (Kafka & Flink) | |
| Workshop 1 | Data Ingestion with dlt |
An end-to-end data engineering pipeline analyzing ~168,000 global earthquake events (2020–2025) from the USGS Earthquake Hazards Program, with cloud infrastructure on AWS and an interactive Streamlit dashboard.
USGS API ──► CSV (Data Lake + S3) ──► DuckDB + S3 Parquet (Warehouse) ──► dbt ──► Streamlit
▲
Terraform (AWS IaC)
- Cloud Infrastructure — Terraform provisions AWS S3 buckets (data lake + warehouse) with versioning & lifecycle policies
- Ingestion — Fetches earthquake data from the USGS REST API in quarterly chunks; uploads to S3 data lake
- Warehouse — Loads into DuckDB with sorted tables for zone-map optimization; exports partitioned Parquet to S3
- Transformations — dbt staging + fact table + 3 mart models; 15 schema tests (all passing)
- Dashboard — 4 interactive tiles: temporal trends, magnitude distribution, top regions, global earthquake map
- Orchestration — Makefile runs the full pipeline (
make all) + infra (make infra-up) + dashboard (make dashboard)
| Tile | Description |
|---|---|
| 📈 Earthquake Activity Over Time | Monthly count colored by avg magnitude |
| 📊 Magnitude Distribution | Pie + stacked bar (Minor → Great) |
| 🏔️ Top Active Regions | 20 most earthquake-prone regions |
| 🗺️ Global Earthquake Map | Interactive scatter geo with 5,000 sampled events |
cd earthquake-analytics
pip install -r requirements.txt
make infra-up # (optional) provision AWS S3
make all # Run full pipeline (ingest → load → transform)
make dashboard # Launch Streamlit dashboard.
├── home/ # Homework solutions for each module
│ ├── 01/
│ ├── 02/
│ ├── 03/
│ ├── 04/
│ ├── 05/
│ ├── 06/
│ ├── 07/
│ └── workshop/
├── earthquake-analytics/ # Final project
│ ├── pipeline/ # Ingestion & warehouse loading
│ ├── dbt_earthquake/ # dbt models & tests
│ ├── dashboard/ # Streamlit app
│ ├── terraform/ # AWS S3 infrastructure (IaC)
│ └── Makefile # Pipeline orchestration
├── 01-docker-terraform/ # Course material (modules 1-7)
├── 02-workflow-orchestration/
├── ...
└── README.md
Thanks to Alexey Grigorev and the entire DataTalks.Club team for offering this incredible course for free. Special thanks to all the instructors and the community on Slack for their support throughout the cohort.