Skip to content

karinaantoniu/Recommender-System

Repository files navigation

Movie Recommender System: Overview

This project implements a Movie Recommender System using both Content-Based Filtering and Collaborative Filtering techniques. The system uses multiple datasets from TMDB and MovieLens to provide personalized movie recommendations, analyze genre-based profitability. The project integrates Python, Pandas, NumPy, Matplotlib and Scikit-Learn libraries.

Primary goals:

Implement Content Based Filtering to recommend movies similar to a given one and Collaborative Filtering which is based on user preferences. In addition, I preproccesed the data and analyzed movie profitability and the return on investment (ROI) per genre. Historical trends in movie profits can be seen with the animated bar charts.

Recommendation Systems

Content-Based Filtering Content Based Filtering is an aproach in which recommendations are generated by analyzing the items and sugesting those that are similar to items the user has previously interacted with. In movie recommender systems, this involves metadata such as genres, cast, crew, keywords or plot descriptions. These features are represented as high dimensional vectors, constructed using NLP tehniques such as TF-IDF (Time Frequency - Inverse Document Frequency). The degree of similarity between the 2 movies is measured with the cosine similarity (or with the Euclidean distance) and the system ranks movies based on their proximity to the feature space.

Collaborative Filtering Collaborative Filtering has the general idea that users with the same past preferences will have the same future preferences, and the recommendations come from the interactions of the community. A user-item rating matrix is constructed, which is typically sparse due to the vast number of items relative to the numbers of ratings per user. Similarity can be computed either between users or between items using statistical tehniques such as cosine similarity. More advanced implementations involves dimension reduction which can be achieved with the Singular Value Decomposition (SVD). The main disadvantage is the problem with the new users or items.

Dataset Description

For this project i used three main datasets: Movies Metadata (movies_metadata.csv), Credits (credits.csv), Keywords (keywords.csv) and Ratings (ratings_small.csv)

Data Preprocessing

Parsing Columns
JSON-like columns such as genres, production_companies, and spoken_languages are converted to Python lists using the parseColumn function. Missing values and any invalid entries are replaced with empty lists.

Poster URLs
All poster paths are converted to full URLs. If an image URL is invalid or missing, it is replaced with a standard image (question-mark.jpg).

Profit and ROI
Profit is calculated as revenue - budget.
ROI is calculated as (revenue - budget) / budget * 100.
Only movies with valid revenue and budget are included in further analysis.

Genre Filtering
Movies are grouped by genre for profit and ROI analysis. Each genre is separately sorted by year for the animated visualization.

Data Analysis

Top Movies by Profit
The system identifies movies with the highest profit and displays their title, budget, profit, and ROI.

Genre-Based Profit Analysis
A bar plot visualizes total profit per genre. ROI per genre is also visualized to understand efficiency in terms of investment return.

Animated Profit Trends
Animated bar charts show profit trends per genre over years. The animation uses matplotlib.animation.FuncAnimation.

Author
Karina Antoniu – UNSTPB, Facultatea de Automatica si Calculatoare Bucuresti
26.07.2024

Datasets were downloaded from Kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages