This is a beginner-friendly Python project where I explore and analyze movie data from the IMDb dataset. The goal was to practice loading, cleaning, analyzing, and visualizing real-world data, all using Python!
- Downloads IMDb datasets (automatically!)
- Loads and merges data from multiple .tsv files
- Analyzes:
- Top 10 highest-rated movies
- Average ratings by decade
- Top directors (with at least 3 movies)
- Generates bar charts and line plots for each of the above
- Saves results as CSV and PNG files in the output folder
imdb-movie-analysis/
├── data/ # IMDb .tsv files go here (downloaded automatically)
├── output/ # Results (CSV + charts) saved here
├── downloader.py # Downloads and extracts the IMDb data
├── main.py # Main script that runs the full pipeline
├── data_loader.py # Loads and merges datasets
├── analysis.py # Functions for calculating top movies, decades, etc.
├── visualize.py # Generates plots from the analysis
├── requirements.txt # Required packages (pandas, matplotlib)
├── .gitignore # Ignores data and output in Git
└── README.md # This file
- Make sure you have Python 3 installed
- Install required packages (you can use a virtual environment if you like):
pip install -r requirements.txt- Download the data and run the analysis:
python downloader.py
python main.py- Check the output/ folder for results!
- output/top_10_movies.csv and top10_movies.png
- output/ratings_by_decade.csv and decade_ratings.png
- output/top_directors.csv and top_directors.png
- How to use pandas for real dataset merging and cleaning
- Simple data analysis and grouping with groupby
- Creating plots with matplotlib
- Organizing code into multiple Python files (modularity)
- Using os, urllib, gzip, and file paths
Made by a junior Python developer learning how to analyze data with code. I built this to learn more about working with real datasets and writing modular Python scripts.
Feel free to fork or improve it!