A scalable movie recommendation system using Apache Spark and ALS.
# Run with 5% sample size
python run_local_test.py --sample_size small
# Run with 10% sample size
python run_local_test.py --sample_size medium
# Run with 25% sample size
python run_local_test.py --sample_size large# Run with 25% sample size
python run_on_gcp.py --sample_size small
# Run with 50% sample size
python run_on_gcp.py --sample_size medium
# Run with 100% sample size
python run_on_gcp.py --sample_size large- Python 3.8+
- Apache Spark 3.2+
- PySpark
- NumPy
- Pandas
- Matplotlib
- Seaborn
Place MovieLens 25M dataset in data/ml-25m/:
- ratings.csv
- movies.csv
- Local mode: 4GB driver, 6GB executor memory
- Cluster mode: 4GB driver, 8GB executor memory
- ALS parameters in
config.py
- Recommendations saved to
output/recommendations/ - Performance metrics in
output/metrics/ - Visualizations in
output/plots/