In this project, we put ourselves in the shoes of a data scientist who needs to take charge of a movie recommendation system for streaming platforms like Netflix, Amazon Prime, Disney Plus, and Hulu. The problem is that the data is unprocessed, there are no automated processes, and so on. Our task is to quickly perform data engineering work and have an MVP (minimum viable product) ready in about a week.
All data analysis was performed in Python, using the following libraries: * Pandas * Numpy * Scikit-Learn * Matplotlib * Seaborn
There are 4 datasets on movies and series on different streaming platforms:
- Amazon:
./data/amazon_prime_titles.csv - Disney plus:
./data/disney_plus_titles.csv - Hulu:
./data/hulu_titles.csv - Netflix:
./data/netflix_title.csv
- show_id: String s+number e.g: s125
- type: Movie/Tv Show
- title: Movie name
- director: director
- cast: cast
- country: country
- date_added: date added to platform
- release_year: release year
- rating: +13/+18/etc
- duration: duration (180 min or 3 seasons)
- listed_in: movie category (romance/drama/horror/...)
- description: movie description
- A new
idcolumn was added, formed by the first letter of the platform and theshow_id - Null values from
ratingcolumn were replaced to "G" --> Mature audience - Dates were parsed from (Month_name day, year ) to (yyyy/mm/dd)
- All text field were transformed to lowercase
- In the Hulu dataset, values from the
durationcolumn had been mistakenly included in theratingcolumn. This problem was fixed. - The NaNs in the
durationcolumn have been replaced by "0 min" - The
durationcolumn was split into two different columns (duration_intandduration_type) - With all these transformations, there are still several columns with missing data.
Probably many of the missing data from the
director,castandcountrycolumns of a dataset (for example Netflix) could be found in another dataset (e.g Amazon). I leave it as a future task
- For now, all missing data will be replace with "unknown "+cast/country/...
- A large dataframe was created, and the columns 'date_added' and 'duration' were dropped.
- Save large dataframe to
/data/clean/all_together_clean.csv
Note: When work locally, I have saved each individual cleaned dataframe to a separate CSV file. But it were included in the .gitignore file
We need to concatenate the user reviews information, which is available in 8 .csv files
- No missing data
- We upload the 8 files and save the data in a big dataframe
- We dropped
timestampcolumn - Column
ratingwere renamed toscoreto match up with nomenclature used in the platforms datasets - Save to
/data/clean/all_ratings.csv(add to gitignore file because is too large)
We only merge users reviews with movie information, and save to /data/clean/all_together_with_score
➡️ Documentation: Api documentation
➡️ Public repository to deploy the API in render: Api Github repo
🟢 GET /api
The API will return some info about other routes
Full path: https://noeliamovieapideploy.onrender.com/api
{
"routes": {
"/api/max_duration": "Movie with longer duration with optional filters of year, platform and duration_type",
"/api/score_count": "Number of films by platform with a score greater than XX in a given year",
"/api/count_platform": "Number of movies per platform. The platform must be specified.",
"/api/actor": "Actor who appears the most times according to platform and year. "
}
}
🟢 GET /api/max_duration
If called without query parameters, the API will return the movie with the longest duration across all platforms, years, and duration types (seasons or minutes)
Full path: https://noeliamovieapideploy.onrender.com/api/max_duration
{
"movie": "soothing surf at del norte for sleep black screen"
}
🟢 GET /api/max_duration?year=2019
The API will return the movie with the longest duration across all platforms and duration types (seasons or minutes) for the 2019 year
Full path: https://noeliamovieapideploy.onrender.com/api/max_duration?year=2019
{
"movie": "box fan medium 8 hours for sleep"
}
🟢 GET /api/max_duration?platform=netflix
The API will return the movie with the longest duration on Netflix across all years and duration types.
Full path: https://noeliamovieapideploy.onrender.com/api/max_duration?platform=netflix
{
"movie": "box fan medium 8 hours for sleep"
}
🟢 GET /api/max_duration?year=2020&platform=Netflix&duration_type=seasons
The API will return the Netflix movie with the longest duration measured in seasons for the year 2019
Full path: https://noeliamovieapideploy.onrender.com/api/max_duration?year=2020&platform=Netflix&duration_type=seasons
{
"movie": "grey's anatomy"
}🟡 If any of the above queries do not match any results, the API will return:
{
"message": "No results"
}🟢 GET /api/score_count?platform=Netflix&scored=3.6&year=2020
The API will return the number of films by platform=Netflix with a score greater than XX=3.6 in a given year (2020)
Full path: https://noeliamovieapideploy.onrender.com/api/score_count?platform=Netflix&scored=3.6&year=2020
{
"number_of_films": 71
}🟢 GET api/count_platform?platform=amazon
The API will return Number of movies per platform (Amazon)
Full path: https://noeliamovieapideploy.onrender.com/api/count_platform?platform=amazon
{
"number_of_films": 9668
}🟡 If the platform is not Netflix, Amazon, Hulu or Disney (or any variations of their names in lowercase), the API will return
{
"detail": [
{
"loc": [
"query",
"platform"
],
"msg": "value is not a valid enumeration member; permitted: 'disney', 'netflix', 'amazon', 'hulu', 'Disney', 'Netflix', 'Amazon', 'Hulu'",
"type": "type_error.enum",
"ctx": {
"enum_values": [
"disney",
"netflix",
"amazon",
"hulu",
"Disney",
"Netflix",
"Amazon",
"Hulu"
]
}
}
]
}🟢 GET /api/actor?platform=Netflix&year=2019
The API will return the actor who appears the most times according to platform and year specified.
Full path: https://noeliamovieapideploy.onrender.com/api/actor?platform=Netflix&year=2019
{
"actor": "vincent tong"
}🟡 If the query do not match any results, the API will return:
{
"message": "No results"
}In this project a movie recommendation system was developed using the collaborative filtering technique through Singular Value Decomposition
A recommendation system implemented with collaborative filtering is a type of system that uses data from multiple users to recommend items to a particular user. It works by analyzing user behavior and preferences, and identifying patterns in the behavior of similar users. Based on these patterns, the system can predict which items a user might be interested in and recommend them to the user. Collaborative filtering can be implemented using different techniques, such as user-based or item-based filtering, and can be used in various applications, such as e-commerce, social media, and entertainment platforms
- The dataset is severely imbalanced, with many more reviews with 3, 4, and 5 ratings than reviews with 1 and 2 ratings. This will cause the recommendation system (if no balancing technique is used) to have a bias towards higher ratings and tend to recommend all movies.
- Another issue is that there are many users who have made very few reviews. The type of recommendation system we are implementing requires users to express their preferences to recommend future items based on their similarity in tastes with other users.
- There are several outliers among users, with one user having more than 18,000 recommendations.
We tried different strategies to achieve a robust recommendation system that can address the issues mentioned above.
- Train a model only with users who have made at least Q1=15 recommendations (where Q1 is the first quartile of the data). This way, we removed users with very few reviews.
- We also removed the upper outliers, i.e., those users with a number of reviews greater than (Q3 + 1.5 IQR) = 207 reviews, where Q3 is the third quartile and IQR is the interquartile range.
- We created 10 artificial datasets using oversampling for the underrepresented classes and subsampling for the majority classes. This way, we ensured balanced classes.
- Each dataset was split into a training set and a test set (15% for the test set).
- We trained an ensemble of 10 SVD models, each using a different subset of data. The final recommendation decision was based on voting (more than half, plus one)
def make_ensemble_predictions(userId, movieId):
"""
Ensemble of 10 SVD models
"""
preds = []
for i in range(10):
p1 = models[i].predict(userId, movieId).est
preds.append(p1)
preds = np.array(preds)
votes = preds>2.5
nvotes = np.count_nonzero(votes)
if nvotes>4:
return "Recomended Movie"
else:
return "not recommended Movie"from surprise import accuracy
# predictions across all models
all_preds = []
for idx, model in enumerate(models):
all_preds.append(model.test(train_test_sets[i][1]))
all_rmse = []
for p in all_preds:
all_rmse.append(accuracy.rmse(p))
RMSE: 1.2439
RMSE: 1.2438
RMSE: 1.2439
RMSE: 1.2439
RMSE: 1.2438
RMSE: 1.2439
RMSE: 1.2438
RMSE: 1.2440
RMSE: 1.2438
RMSE: 1.2439These results mean that the error in recommending a movie that a user would rate as 3.5 (recommended movie) could be scored on the lower end as 2.2 (not recommended movie).
Another similar example, a movie rated 2.3, could be scored as 3.5 and recommended to a user who clearly has no preference for this movie.


