This repository contains clean, publishable material for multiple ML notebooks with their required datasets.
- Graded project notebook: predict
genrefrom movie/series metadata and text features. - TD Project 01 notebook: introductory data manipulation and analysis tasks.
- TD Project 02 notebook: classification workflow on breast cancer data.
- TD Project 03 notebook: predictive maintenance classification workflow.
- TD Project 04 notebook: run classification experiments on heart disease, stars, and glass datasets.
TD_2026_S2_M1_ML_Project_Graded.ipynb— complete notebook (problem framing, preprocessing, modeling, evaluation)imdb_descr_titles.csv— input datasetTD_2026_S2_M1_ML_Project_04.ipynb— TD notebookheart_disease_classification.csv— dataset used in TD Project 04stars_nasa_classification.csv— dataset used in TD Project 04glass_classification.csv— dataset used in TD Project 04TD_2026_S2_M1_ML_Project_01.ipynb— TD notebookEU_countries.csv— dataset used in TD Project 01TD_2026_S2_M1_ML_Project_02.ipynb— TD notebookbreast_cancer.csv— dataset used in TD Project 02TD_2026_S2_M1_ML_Project_03.ipynb— TD notebookpredictive_maintenance.csv— dataset used in TD Project 03
The notebook follows a full ML workflow:
- Problem framing (multiclass supervised classification)
- Data preprocessing
- target construction from
genres - text feature engineering (
title+description) - categorical preprocessing (One-Hot Encoding)
- numeric imputation
- target construction from
- Modeling
DummyClassifier(baseline)LogisticRegressionLinearSVCMultinomialNB(text baseline)
- Evaluation
- cross-validation metrics
- test metrics
- comparison table and short commentary
- Python 3.10+
numpypandasscikit-learnjupyter
Install dependencies (if needed):
pip install numpy pandas scikit-learn jupyter- Keep
imdb_descr_titles.csvin the same directory as the notebook. - Open
TD_2026_S2_M1_ML_Project_Graded.ipynb. - Run all cells sequentially from top to bottom.
- Keep these files in the same directory:
heart_disease_classification.csvstars_nasa_classification.csvglass_classification.csv
- Open
TD_2026_S2_M1_ML_Project_04.ipynb. - Run all cells sequentially from top to bottom.
- Keep
EU_countries.csvin the same directory as the notebook. - Open
TD_2026_S2_M1_ML_Project_01.ipynb. - Run all cells sequentially from top to bottom.
- Keep
breast_cancer.csvin the same directory as the notebook. - Open
TD_2026_S2_M1_ML_Project_02.ipynb. - Run all cells sequentially from top to bottom.
- Keep
predictive_maintenance.csvin the same directory as the notebook. - Open
TD_2026_S2_M1_ML_Project_03.ipynb. - Run all cells sequentially from top to bottom.
The notebook uses fixed random seeds in key steps (e.g., stratified split / CV settings where applicable) to keep results stable across runs.
- The dataset size is compatible with standard GitHub limits (well below 100 MB).
- The repository is intentionally minimal and focused on the published notebooks + their datasets.