Movie Recommendation System using PySpark

Project Overview

This project is focused on building a movie recommendation system using PySpark to process the movies_metadata.csv dataset in a distributed environment. The environment consists of 10 virtual machines (VMs), where one VM serves as the master node and the other 9 as slave nodes. Spark and Hadoop have been configured across these VMs to enable distributed data processing.

Project Setup

Software and Environment

Operating System: Fedora 37
Hadoop Version: 3.3.6
Spark Version: 3.5.0
Java Version: 21.0.1
Python Version: 3.11.0
Cluster Configuration:
- Master Node: hadoop1 (192.168.13.143)
- Slave Nodes: hadoop2 to hadoop10 (192.168.13.144 - 192.168.13.171)

Directory Structure

The following directory structure is used for the setup:

/opt
  ├── hadoop-3.3.6/
  ├── spark-3.5.0/
  ├── java-21.0.1/
  └── python-3.11.0/

Key Configuration

Passwordless SSH is set up between the master node and all slave nodes to enable communication.
Hadoop and Spark configuration files are updated to run Spark in cluster mode across the nodes.

Dataset

The dataset used is movies_metadata.csv, which includes various metadata related to movies such as:

Revenue
Runtime
Vote count
Vote average
Other relevant attributes

Python Script

The movie_recommend_pyspark.py script implements the logic for processing the dataset and generating movie recommendations. This script is distributed across the Hadoop-Spark cluster, leveraging the power of PySpark for large-scale data processing.

How to Run

Copy the Python Script: Copy the movie_recommend_pyspark.py file from the local machine to the master node (hadoop1).
Run the PySpark Job:
- SSH into the master node:
```
ssh sat3812@192.168.13.143
```
- Submit the PySpark job:
```
spark-submit --master yarn --deploy-mode cluster /opt/movie_recommend_pyspark.py
```
This command will distribute the job across the cluster.
View Results: Once the job completes, the output will be available in the logs of the master node, and the recommendations will be printed in the terminal.

Goals and Deliverables

Setup a distributed environment for Hadoop and Spark.
Run distributed data processing jobs using PySpark.
Generate movie recommendations from the movies_metadata.csv dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
movie_recommend_pyspark.py		movie_recommend_pyspark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie Recommendation System using PySpark

Project Overview

Project Setup

Software and Environment

Directory Structure

Key Configuration

Dataset

Python Script

How to Run

Goals and Deliverables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Movie Recommendation System using PySpark

Project Overview

Project Setup

Software and Environment

Directory Structure

Key Configuration

Dataset

Python Script

How to Run

Goals and Deliverables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages