This project provides a containerized environment for Apache Spark, Apache Hive, and Apache Hadoop YARN using Docker.
- Apache Spark 3.4.1
- Apache Hive 3.1.3
- Apache Hadoop 3.3.5
- Apache YARN (Hadoop 3.3.5)
DockerSpark/
├── docker-compose.yml
├── Dockerfile
├── config/
│ ├── spark/
│ ├── hive/
│ └── hadoop/
├── scripts/
│ ├── entrypoint.sh
│ ├── init-hdfs.sh
│ └── init-yarn.sh
├── logs/
│ ├── spark/
│ ├── hive/
│ └── yarn/
└── data/
- Docker Engine 20.10.0 or later
- Docker Compose v2.0.0 or later
- At least 8GB of RAM allocated to Docker
- At least 20GB of free disk space
-
Clone the repository:
git clone <repository-url> cd DockerSpark
-
Build and start the containers:
docker-compose up -d
-
Wait for all services to initialize (about 1-2 minutes)
-
Access service UIs:
- Spark Master UI: http://localhost:8080
- Spark Worker UI: http://localhost:8081
- YARN ResourceManager UI: http://localhost:8088
- YARN NodeManager UI: http://localhost:8042
- Hive Server: localhost:10000
- Hive Metastore: localhost:9083
# Submit a Python application
docker exec spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
/path/to/your/app.py
# Submit a Spark SQL application
docker exec spark-master /opt/spark/bin/spark-sql \
--master spark://spark-master:7077 \
-f /path/to/your/query.sql# Submit in client mode
docker exec spark-master /opt/spark/bin/spark-submit \
--master yarn \
--deploy-mode client \
/path/to/your/app.py
# Submit in cluster mode
docker exec spark-master /opt/spark/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
/path/to/your/app.py-
Connect to Hive Server:
docker exec -it hive-server beeline -u jdbc:hive2://localhost:10000 -
Run Hive queries:
-- Create a database CREATE DATABASE example; USE example; -- Create a table CREATE TABLE test (id INT, name STRING); -- Insert data INSERT INTO test VALUES (1, 'test1'), (2, 'test2'); -- Query data SELECT * FROM test;
# List HDFS contents
docker exec yarn-resourcemanager hdfs dfs -ls /
# Create a directory
docker exec yarn-resourcemanager hdfs dfs -mkdir -p /user/data
# Upload a file
docker exec yarn-resourcemanager hdfs dfs -put /path/to/local/file /user/data/
# View a file
docker exec yarn-resourcemanager hdfs dfs -cat /user/data/file-
Run the included Spark-Hive test:
docker cp test_spark_hive.py spark-master:/opt/spark/work-dir/ docker exec spark-master /opt/spark/bin/spark-submit \ --master spark://spark-master:7077 \ /opt/spark/work-dir/test_spark_hive.py -
Run the YARN test application:
docker cp yarn_test.py spark-master:/opt/spark/work-dir/ docker exec spark-master /opt/spark/bin/spark-submit \ --master yarn \ --deploy-mode client \ /opt/spark/work-dir/yarn_test.py
-
View container logs:
# View Spark Master logs docker logs spark-master # View YARN ResourceManager logs docker logs yarn-resourcemanager
-
Access container shell:
# Access Spark Master docker exec -it spark-master bash # Access YARN ResourceManager docker exec -it yarn-resourcemanager bash
-
Stop all services:
docker-compose down
-
Reset everything (including volumes):
docker-compose down -v
-
If YARN applications fail to start:
- Check YARN ResourceManager UI (http://localhost:8088)
- Verify HDFS is running:
docker exec yarn-resourcemanager hdfs dfsadmin -report - Check container logs for errors
-
If Spark applications fail:
- Check Spark Master UI (http://localhost:8080)
- Verify Spark Worker is connected
- Check application logs in the respective UI
-
If Hive queries fail:
- Verify Hive Metastore is running
- Check Hive Server logs
- Ensure the database exists and permissions are correct
Configuration files are located in the config directory:
config/spark/: Spark configuration filesconfig/hive/: Hive configuration filesconfig/hadoop/: Hadoop/YARN configuration files
You can submit Spark applications to YARN in either client or cluster mode:
# Client mode
spark-submit --master yarn --deploy-mode client your_app.py
# Cluster mode
spark-submit --master yarn --deploy-mode cluster your_app.pyThe YARN ResourceManager is configured with a default queue that has:
- 100% capacity allocation
- Minimum user limit of 100%
- Fair scheduler for resource allocation
HDFS is configured with:
- NameNode running on the YARN ResourceManager container
- DataNode running on the YARN NodeManager container
- Default replication factor of 1 for development purposes
- Root directories for Spark and YARN applications
You can monitor your applications through:
- YARN ResourceManager UI (http://localhost:8088) for:
- Application status
- Resource usage
- Container allocation
- YARN NodeManager UI (http://localhost:8042) for:
- Container logs
- Node health
- Spark UI (http://localhost:8080) for:
- Spark application details
- Job progress
- Stage information