Real-time data platform built around Kafka, AWS Lambda, Kubernetes consumers, Airflow, and Snowflake. The repo models an event-driven architecture where streaming ingestion, near-real-time processing, and scheduled analytics workflows all coexist in the same system.
This project is a strong architecture portfolio piece because it crosses multiple domains at once: streaming, serverless processing, Kubernetes-based consumers, batch orchestration, and analytics warehousing.
- Kafka/MSK infrastructure and supporting Terraform
- Python event producers
- Kubernetes consumer workloads
- Lambda-style event transformation layer
- Airflow DAGs for downstream batch ETL
- Snowflake-oriented analytics pipeline
- monitoring configuration for throughput and consumer lag
- Producers emit application events into Kafka.
- Streaming consumers process events from Kubernetes.
- Lambda/EventBridge components handle transformation and routing tasks.
- Airflow runs scheduled aggregation workflows.
- Curated data is loaded into Snowflake for analytics and reporting.
# Provision Kafka cluster
cd terraform/modules/msk && terraform init && terraform apply
# Start event producer
python producers/src/producer.py --rate 500
# Deploy consumers
kubectl apply -f k8s/consumers/
# Trigger batch ETL
airflow dags trigger daily_events_to_snowflake.
|-- airflow/ # DAG definitions
|-- consumers/ # streaming consumer code
|-- docker/ # container images
|-- k8s/ # Kubernetes deployment manifests
|-- monitoring/ # Prometheus and Grafana configuration
|-- producers/ # event producer code
|-- terraform/ # MSK and Lambda infrastructure
|-- docs/ # diagrams and supporting documentation
`-- .github/ # validation workflows
- event-driven system design with both streaming and batch layers
- Kafka-based ingestion paired with Kubernetes-scale consumers
- integration of serverless components into a broader data platform
- operational thinking around monitoring, lag, and pipeline reliability
