DataForge
Docker-first data engineering stack: Spark, Hadoop (HDFS), Kafka + Zookeeper, Airflow, Postgres, MongoDB, and Jupyter β all wired together with Docker Compose.
βΈ»
β¨ What you get β’ Spark master/worker with web UIs β’ HDFS NameNode/DataNode β’ Kafka + Zookeeper β’ Airflow (LocalExecutor) pre-wired to Postgres & Spark β’ Postgres + MongoDB with persistent named volumes β’ Jupyter (PySpark-ready)
Spark uses the official Bitnami image. Airflow and Jupyter are custom images hosted on Docker Hub under ashutro.
βΈ»
π¦ Docker Hub Images
ashutro/dataforge-airflow:1.0(custom Airflow image)ashutro/dataforge-jupyter:1.0(custom Jupyter image)bitnami/spark(official Spark image)
These images extend official bases:
ashutro/dataforge-airflowβ based onapache/airflow:2.9.2with extra JDK and Python dependencies.ashutro/dataforge-jupyterβ based onjupyter/base-notebook(with PySpark support).
This ensures compatibility with official distributions while adding project-specific tooling.
βΈ»
π§° Prerequisites β’ Docker (Docker Desktop on macOS/Windows; Docker Engine on Linux) β’ Docker Compose v2 (bundled with Docker Desktop; docker compose version should work) β’ Git (for cloning the repo) β’ (Optional) Docker Hub account if you plan to push your own images
βΈ»
π Quick Start (local)
git clone https://github.com/ashutro/dataforge.git cd dataforge
cp .env.example .env
docker build -t ashutro/dataforge-airflow:1.0 ./airflow docker build -t ashutro/dataforge-jupyter:1.0 ./jupyter
docker compose up -d
docker compose ps
βΈ»
π Access URLs (localhost) β’ Airflow Web UI β http://localhost:8082 User/Pass: ${AIRFLOW_ADMIN_USER} / ${AIRFLOW_ADMIN_PASSWORD} (from .env) β’ Spark Master UI β http://localhost:8080 β’ Spark Worker UI β http://localhost:8081 β’ HDFS NameNode UI β http://localhost:9870 β’ Kafka Broker β localhost:9092 (for host tools on macOS; see notes for Linux/Cloud) β’ Postgres β localhost:5432 (user: ${POSTGRES_USER}, db: ${POSTGRES_DB}) β’ MongoDB β localhost:27017 (user: ${MONGO_INITDB_ROOT_USERNAME}) β’ Jupyter β http://localhost:8888 First login requires a token printed in logs: docker logs jupyter | grep -m1 token=
βΈ»
βοΈ Environment Variables
Copy .env.example to .env and edit values:
POSTGRES_USER=your_user POSTGRES_PASSWORD=your_password POSTGRES_DB=dataforge
MONGO_INITDB_ROOT_USERNAME=your_mongo_root MONGO_INITDB_ROOT_PASSWORD=your_mongo_password
AIRFLOW_ADMIN_USER=admin AIRFLOW_ADMIN_PASSWORD=admin AIRFLOW_ADMIN_FIRSTNAME=Ashutosh AIRFLOW_ADMIN_LASTNAME=Kumar AIRFLOW_ADMIN_EMAIL=admin@example.com
docker-compose.yml uses these variables for container env and Airflow user creation.
βΈ»
π Building & Publishing Images (optional)
You only need to build/push Airflow and Jupyter (Spark uses official Bitnami image):
docker build -t ashutro/dataforge-airflow:1.0 ./airflow docker build -t ashutro/dataforge-jupyter:1.0 ./jupyter
docker login docker push ashutro/dataforge-airflow:1.0 docker push ashutro/dataforge-jupyter:1.0
Once published, anyone can run the stack without building locally.
βΈ»
π¦ Docker Hub Images
The Airflow and Jupyter images are custom builds hosted on Docker Hub under the username ashutro. The Spark image uses the official Bitnami Spark image as its base. You can extend these official bases or use the images directly in your own projects.
π½ Pulling the images directly:
docker pull ashutro/dataforge-airflow:1.0
docker pull ashutro/dataforge-jupyter:1.0
docker pull bitnami/spark:3.5.0βΈ»
π§© Service-to-Service Connection Details β’ Spark Master URL (inside containers): spark://spark-master:7077 β’ Kafka bootstrap (inside containers): kafka:9092 β’ Kafka bootstrap (host tools on macOS/Windows): host.docker.internal:9092 (as configured) For Linux or Cloud, see Kafka notes below. β’ Postgres DSN: postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB} β’ Mongo URI: mongodb://${MONGO_INITDB_ROOT_USERNAME}:${MONGO_INITDB_ROOT_PASSWORD}@mongo:27017 β’ Airflow is configured with AIRFLOW_CONN_SPARK_DEFAULT=spark://spark-master:7077.
βΈ»
π Scaling Spark Workers
docker compose up -d --scale spark-worker=3
View workers in the Spark Master UI (http://localhost:8080).
βΈ»
βοΈ Run on a Cloud VM (Ubuntu example) 1. Create a VM (Ubuntu 22.04). Open security group/firewall for the ports you need (or use SSH tunnels below). 2. SSH to the VM and install Docker & Compose:
curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER newgrp docker
3. Clone and configure:
git clone https://github.com/ashutro/dataforge.git cd dataforge cp .env.example .env
4. (Optional) Login & pull your published images:
docker login docker compose pull
5. Start:
docker compose up -d docker compose ps
6. Access UIs securely using SSH tunnels (recommended):
ssh -L 8082:localhost:8082 -L 8888:localhost:8888 -L 8080:localhost:8080
-L 9870:localhost:9870 -L 9092:localhost:9092 user@your-vm-ip
Then open http://localhost:8082, http://localhost:8888, etc.
ssh.exe -L 8082:localhost:8082 -L 8888:localhost:8888 -L 8080:localhost:8080 ` -L 9870:localhost:9870 -L 9092:localhost:9092 user@your-vm-ip
Then open http://localhost:8082, http://localhost:8888, etc.
You can also use PuTTY if you prefer a GUI:
- Go to Connection > SSH > Tunnels
- Add the same local ports (8082, 8888, 8080, 9870, 9092) mapped to
localhost:PORT - Save and open the session to establish tunnels.
If you must expose ports publicly, ensure proper firewall rules and credentials.
βΈ»
π§ Kafka on Linux/Cloud: advertised listeners
The compose file sets:
KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://host.docker.internal:9092
β’ macOS/Windows host tools β OK.
β’ Linux host tools β host.docker.internal may not resolve. Use either:
β’ inter-container only: PLAINTEXT://kafka:9092
β’ external clients: PLAINTEXT://<HOST_PUBLIC_IP>:9092
Update kafka service env in docker-compose.yml accordingly when running on Linux or Cloud.
βΈ»
π§ͺ Verifying the stack
docker compose logs -f airflow
docker logs jupyter | grep -m1 token=
psql postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@localhost:5432/${POSTGRES_DB}
βΈ»
π Stop / Reset
docker compose down
docker compose down -v
βΈ»
π§° Troubleshooting
β’ Port already in use β Change the host port in docker-compose.yml or free the port.
β’ Apple Silicon β Hadoop images run under linux/amd64 emulation. Expect overhead; allocate enough RAM/CPU in Docker Desktop.
β’ Airflow user creation repeats β The command tolerates existing user (|| true). Check airflow logs if webserver/scheduler donβt start.
β’ Kafka clients canβt connect β Fix KAFKA_CFG_ADVERTISED_LISTENERS for your environment (see Kafka section).
β’ Permission on Docker socket (Linux) β If Airflow needs Docker access via /var/run/docker.sock, ensure your user is in the docker group, or run with sudo.
β’ Windows WSL2 networking β On Windows with WSL2 backend, localhost works for exposed ports. If containers canβt connect to each other via host.docker.internal, update the compose file to use service names (e.g., kafka:9092) instead.
βΈ»
π Upgrading & Version pinning β’ Change image tags in docker-compose.yml (e.g., :1.1 or :3.5.0). β’ Then:
docker compose pull docker compose up -d
βΈ»
π Folder structure (suggested)
. βββ docker-compose.yml βββ README.md βββ .env.example -> copy to .env βββ airflow/ β βββ Dockerfile β βββ requirements.txt β βββ dags/ β βββ logs/ β βββ plugins/ βββ jupyter/ βββ Dockerfile βββ requirements.txt
βΈ»
π Credits β’ Spark image by Bitnami β’ Hadoop images by bde2020 β’ Kafka/Zookeeper images by Bitnami β’ Airflow by Apache Airflow β’ Jupyter images by Project Jupyter
βΈ»
π License
Choose a license (e.g., MIT) and add a LICENSE file if publishing publicly.