Analyze job market trends using LinkedIn data on a Big Data stack (Docker + Hadoop + Spark) for the CSE587 Data Intensive Computing course.
- Ingest, process, and analyze large-scale job postings and skills data.
- Orchestrated via Docker Compose with a Hadoop (HDFS/YARN) + Spark cluster.
- Includes notebooks for EDA and a sample Spark job (word count) running on HDFS.
- Docker & Docker Compose
- Hadoop (HDFS, YARN)
- Apache Spark (PySpark)
- Jupyter Notebooks (EDA)
- Docker Desktop installed and running.
- Adequate resources allocated to Docker (CPU/RAM) for Hadoop/Spark.
Start the Hadoop + Spark cluster using the provided compose file.
docker compose -f CSE587Project/docker-compose.yaml up -dExpected output (abbreviated):
[+] Running 5/5
✔ Network project_default Created
✔ Container project-resourcemanager-1 Started
✔ Container project-namenode-1 Started
✔ Container project-datanode1-1 Started
✔ Container project-nodemanager1-1 Started
To stop and remove the cluster:
docker compose -f CSE587Project/docker-compose.yaml downUpload input data to HDFS after the cluster is up.
- Open a shell in the NameNode container:
docker exec -it project-namenode-1 bash- Create the input directory and upload a file:
hdfs dfs -mkdir /input
hdfs dfs -put README.txt /input/wc.txt- Verify the upload:
hdfs dfs -cat /input/wc.txtWeb UI for HDFS file explorer:
http://localhost:9870/explorer.html#/
Submit a simple PySpark job against data stored in HDFS.
- Create
spark.py(inside the NameNode shell or your workspace):
from pyspark import SparkConf, SparkContext
def main():
conf = SparkConf().setAppName("WordCountDemo")
sc = SparkContext(conf=conf)
input_path = "hdfs://namenode/input/wc.txt"
output_path = "hdfs://namenode/output/wordcount_result"
text_file = sc.textFile(input_path)
counts = (
text_file
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
)
counts.saveAsTextFile(output_path)
sc.stop()
if __name__ == "__main__":
main()- Submit the job (inside the NameNode shell):
./spark/bin/spark-submit --master yarn --deploy-mode cluster spark.py- Check results on HDFS:
hdfs dfs -ls /output/wordcount_result
hdfs dfs -cat /output/wordcount_result/part-*CSE587Project/: Core project files and Docker configurationphase1/: Phase 1 report and ingestion scriptsEDA.ipynb: Exploratory Data Analysis notebookLocal EDA.ipynb: Local environment analysislinkedin-jobs-and-skills-eda-project-6.ipynb: Additional EDA notebook
- If containers don’t start, ensure Docker resources are sufficient.
- Use the HDFS Web UI to quickly validate file locations.
- Adapt
spark.pyto your dataset paths and analysis tasks.