Skip to content

huy-dataguy/Spark-on-YARN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🏢 Spark on YARN Architecture Client Mode

image

🚀 Installation Guide

Step 1: Clone the Repository

  git clone https://github.com/huy-dataguy/Spark-on-YARN.git
  cd Spark-on-YARN

Step 2: Build Image Base

⏳ Note: The first build may take a few minutes as no cached layers exist.

  docker build -t base -f docker/base.dockerfile .

Step 3: Build and Start Cluster

  • build image (build in the first time or after make changes in dockerfile)
  docker compose -f docker/compose.yaml build
  • run container
  docker compose -f docker/compose.yaml up -d

Step 4: Verify the Installation

  • Go inside master container's CLI

💡 Start the HDFS - YARN services:

  start-dfs.sh
  start-yarn.sh
image

Step 5: Run Spark Submit on Yarn Client Mode

Create folder store spark logs

  hdfs dfs -mkdir /spark-logs

Run spark on yarn

  spark-submit \
  --class org.apache.spark.examples.SparkPi \
  $SPARK_HOME/examples/jars/spark-examples_*.jar 10

If success you will see answear Pi = 3,14159 image


🌐 Interact with the Web UI

You can access the following web interfaces to monitor and manage your Hadoop cluster:

  • YARN Resource Manager UIhttp://localhost:9004
    Provides an overview of cluster resource usage, running applications, and job details.

  • NameNode UIhttp://localhost:9870
    Displays HDFS file system details, block distribution, and overall health status.

  • Spark Web UIhttp://localhost:4040 Provides an interface to monitor running Spark jobs, stages, and tasks. Note: Because you are using YARN client mode, the Spark UI will automatically redirect to the master node's web UI.

image image

📞 Contact

📧 Email: quochuy.working@gmail.com

💬 Feel free to contribute and improve this project! 🚀

About

This repository contains the configuration and scripts necessary to run Apache Spark on a Hadoop YARN cluster in client mode. The setup allows you to leverage the scalability of YARN for distributed data processing with Spark.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors