Skip to content

πŸš€ Docker-first data engineering platform: Spark, Hadoop (HDFS), Kafka + Zookeeper, Airflow, Postgres, MongoDB, and Jupyter β€” all wired together with Docker Compose.

License

Notifications You must be signed in to change notification settings

ashutro/dataforge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Docker Hub - Airflow Docker Hub - Jupyter Docker Hub - Spark (Bitnami)

DataForge

Docker-first data engineering stack: Spark, Hadoop (HDFS), Kafka + Zookeeper, Airflow, Postgres, MongoDB, and Jupyter β€” all wired together with Docker Compose.

βΈ»

✨ What you get β€’ Spark master/worker with web UIs β€’ HDFS NameNode/DataNode β€’ Kafka + Zookeeper β€’ Airflow (LocalExecutor) pre-wired to Postgres & Spark β€’ Postgres + MongoDB with persistent named volumes β€’ Jupyter (PySpark-ready)

Spark uses the official Bitnami image. Airflow and Jupyter are custom images hosted on Docker Hub under ashutro.

βΈ»

πŸ“¦ Docker Hub Images

  • ashutro/dataforge-airflow:1.0 (custom Airflow image)
  • ashutro/dataforge-jupyter:1.0 (custom Jupyter image)
  • bitnami/spark (official Spark image)

These images extend official bases:

  • ashutro/dataforge-airflow β†’ based on apache/airflow:2.9.2 with extra JDK and Python dependencies.
  • ashutro/dataforge-jupyter β†’ based on jupyter/base-notebook (with PySpark support).

This ensures compatibility with official distributions while adding project-specific tooling.

βΈ»

🧰 Prerequisites β€’ Docker (Docker Desktop on macOS/Windows; Docker Engine on Linux) β€’ Docker Compose v2 (bundled with Docker Desktop; docker compose version should work) β€’ Git (for cloning the repo) β€’ (Optional) Docker Hub account if you plan to push your own images

βΈ»

πŸš€ Quick Start (local)

1) Clone

git clone https://github.com/ashutro/dataforge.git cd dataforge

2) Create your env file

cp .env.example .env

Edit .env with your credentials/secrets

3) (If not pulling from Docker Hub) Build custom images locally

Only needed if you haven’t pushed your images yet

docker build -t ashutro/dataforge-airflow:1.0 ./airflow docker build -t ashutro/dataforge-jupyter:1.0 ./jupyter

4) Start the stack

docker compose up -d

5) Check containers

docker compose ps

βΈ»

🌐 Access URLs (localhost) β€’ Airflow Web UI β†’ http://localhost:8082 User/Pass: ${AIRFLOW_ADMIN_USER} / ${AIRFLOW_ADMIN_PASSWORD} (from .env) β€’ Spark Master UI β†’ http://localhost:8080 β€’ Spark Worker UI β†’ http://localhost:8081 β€’ HDFS NameNode UI β†’ http://localhost:9870 β€’ Kafka Broker β†’ localhost:9092 (for host tools on macOS; see notes for Linux/Cloud) β€’ Postgres β†’ localhost:5432 (user: ${POSTGRES_USER}, db: ${POSTGRES_DB}) β€’ MongoDB β†’ localhost:27017 (user: ${MONGO_INITDB_ROOT_USERNAME}) β€’ Jupyter β†’ http://localhost:8888 First login requires a token printed in logs: docker logs jupyter | grep -m1 token=

βΈ»

βš™οΈ Environment Variables

Copy .env.example to .env and edit values:

Postgres

POSTGRES_USER=your_user POSTGRES_PASSWORD=your_password POSTGRES_DB=dataforge

Mongo

MONGO_INITDB_ROOT_USERNAME=your_mongo_root MONGO_INITDB_ROOT_PASSWORD=your_mongo_password

Airflow admin

AIRFLOW_ADMIN_USER=admin AIRFLOW_ADMIN_PASSWORD=admin AIRFLOW_ADMIN_FIRSTNAME=Ashutosh AIRFLOW_ADMIN_LASTNAME=Kumar AIRFLOW_ADMIN_EMAIL=admin@example.com

docker-compose.yml uses these variables for container env and Airflow user creation.

βΈ»

πŸ— Building & Publishing Images (optional)

You only need to build/push Airflow and Jupyter (Spark uses official Bitnami image):

Build

docker build -t ashutro/dataforge-airflow:1.0 ./airflow docker build -t ashutro/dataforge-jupyter:1.0 ./jupyter

Login & push

docker login docker push ashutro/dataforge-airflow:1.0 docker push ashutro/dataforge-jupyter:1.0

Once published, anyone can run the stack without building locally.

βΈ»

πŸ“¦ Docker Hub Images

The Airflow and Jupyter images are custom builds hosted on Docker Hub under the username ashutro. The Spark image uses the official Bitnami Spark image as its base. You can extend these official bases or use the images directly in your own projects.

πŸ”½ Pulling the images directly:

docker pull ashutro/dataforge-airflow:1.0
docker pull ashutro/dataforge-jupyter:1.0
docker pull bitnami/spark:3.5.0

βΈ»

🧩 Service-to-Service Connection Details β€’ Spark Master URL (inside containers): spark://spark-master:7077 β€’ Kafka bootstrap (inside containers): kafka:9092 β€’ Kafka bootstrap (host tools on macOS/Windows): host.docker.internal:9092 (as configured) For Linux or Cloud, see Kafka notes below. β€’ Postgres DSN: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB} β€’ Mongo URI: mongodb://${MONGO_INITDB_ROOT_USERNAME}:${MONGO_INITDB_ROOT_PASSWORD}@mongo:27017 β€’ Airflow is configured with AIRFLOW_CONN_SPARK_DEFAULT=spark://spark-master:7077.

βΈ»

πŸ“ˆ Scaling Spark Workers

docker compose up -d --scale spark-worker=3

View workers in the Spark Master UI (http://localhost:8080).

βΈ»

☁️ Run on a Cloud VM (Ubuntu example) 1. Create a VM (Ubuntu 22.04). Open security group/firewall for the ports you need (or use SSH tunnels below). 2. SSH to the VM and install Docker & Compose:

curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER newgrp docker

3.	Clone and configure:

git clone https://github.com/ashutro/dataforge.git cd dataforge cp .env.example .env

edit .env

4.	(Optional) Login & pull your published images:

docker login docker compose pull

5.	Start:

docker compose up -d docker compose ps

6.	Access UIs securely using SSH tunnels (recommended):

On your laptop

ssh -L 8082:localhost:8082 -L 8888:localhost:8888 -L 8080:localhost:8080
-L 9870:localhost:9870 -L 9092:localhost:9092 user@your-vm-ip

On Windows PowerShell (using OpenSSH client)

ssh.exe -L 8082:localhost:8082 -L 8888:localhost:8888 -L 8080:localhost:8080 ` -L 9870:localhost:9870 -L 9092:localhost:9092 user@your-vm-ip

On Windows with PuTTY

You can also use PuTTY if you prefer a GUI:

  • Go to Connection > SSH > Tunnels
  • Add the same local ports (8082, 8888, 8080, 9870, 9092) mapped to localhost:PORT
  • Save and open the session to establish tunnels.

If you must expose ports publicly, ensure proper firewall rules and credentials.

βΈ»

🧠 Kafka on Linux/Cloud: advertised listeners

The compose file sets:

KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://host.docker.internal:9092

β€’	macOS/Windows host tools β†’ OK.
β€’	Linux host tools β†’ host.docker.internal may not resolve. Use either:
β€’	inter-container only: PLAINTEXT://kafka:9092
β€’	external clients: PLAINTEXT://<HOST_PUBLIC_IP>:9092

Update kafka service env in docker-compose.yml accordingly when running on Linux or Cloud.

βΈ»

πŸ§ͺ Verifying the stack

Check logs

docker compose logs -f airflow

Jupyter token

docker logs jupyter | grep -m1 token=

Postgres connectivity from host

psql postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@localhost:5432/${POSTGRES_DB}

βΈ»

πŸ›‘ Stop / Reset

Stop but keep data

docker compose down

Stop AND remove named volumes (data loss for Postgres/Mongo/HDFS!)

docker compose down -v

βΈ»

🧰 Troubleshooting β€’ Port already in use β†’ Change the host port in docker-compose.yml or free the port. β€’ Apple Silicon β†’ Hadoop images run under linux/amd64 emulation. Expect overhead; allocate enough RAM/CPU in Docker Desktop. β€’ Airflow user creation repeats β†’ The command tolerates existing user (|| true). Check airflow logs if webserver/scheduler don’t start. β€’ Kafka clients can’t connect β†’ Fix KAFKA_CFG_ADVERTISED_LISTENERS for your environment (see Kafka section). β€’ Permission on Docker socket (Linux) β†’ If Airflow needs Docker access via /var/run/docker.sock, ensure your user is in the docker group, or run with sudo. β€’ Windows WSL2 networking β†’ On Windows with WSL2 backend, localhost works for exposed ports. If containers can’t connect to each other via host.docker.internal, update the compose file to use service names (e.g., kafka:9092) instead.

βΈ»

πŸ”„ Upgrading & Version pinning β€’ Change image tags in docker-compose.yml (e.g., :1.1 or :3.5.0). β€’ Then:

docker compose pull docker compose up -d

βΈ»

πŸ“š Folder structure (suggested)

. β”œβ”€β”€ docker-compose.yml β”œβ”€β”€ README.md β”œβ”€β”€ .env.example -> copy to .env β”œβ”€β”€ airflow/ β”‚ β”œβ”€β”€ Dockerfile β”‚ β”œβ”€β”€ requirements.txt β”‚ β”œβ”€β”€ dags/ β”‚ β”œβ”€β”€ logs/ β”‚ └── plugins/ └── jupyter/ β”œβ”€β”€ Dockerfile └── requirements.txt

βΈ»

πŸ“ Credits β€’ Spark image by Bitnami β€’ Hadoop images by bde2020 β€’ Kafka/Zookeeper images by Bitnami β€’ Airflow by Apache Airflow β€’ Jupyter images by Project Jupyter

βΈ»

πŸ“„ License

Choose a license (e.g., MIT) and add a LICENSE file if publishing publicly.

About

πŸš€ Docker-first data engineering platform: Spark, Hadoop (HDFS), Kafka + Zookeeper, Airflow, Postgres, MongoDB, and Jupyter β€” all wired together with Docker Compose.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published