Skip to content

rociolozanocaro/Unsupervised_Modeling

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mushroom Classification & Unsupervised Learning

English | Español

1. OBJECTIVE

The goal of this project is using an unsupervised model like PCA and clustering techniques in a Mushroom dataset and comparing the effect the dimensionality reduction has on a supervised model like RandomForest.

2. TECHNOLOGIES USED

- Data Analysis: Pandas, Numpy, Missingno.
- Visualization: Matplotlib, Seaborn, Plotly.
- Machine Learning: Scikit-learn (PCA, RandomForestClassifier, KMeans).
- Jupyter Notebooks.
- VSCode.
- Git.

Python Jupyter Scikit-Learn Pandas Plotly Git

3. REPOSITORY STRUCTURE

──  notebooks/
│   └── p9_unsupervised_rocio.ipynb
├── data/
│   └── data_processed
        └── mushrooms_clean.csv
        └── mushrooms_clean.parquet
    └── data_raw
        └── mushrooms.csv
└── README.md
└── requirements.txt

4. HOW TO RUN

  1. Clone the repository
git clone https://github.com/rociolozanocaro/Unsupervised_Modeling
  1. Create and activate virtual environment
python -m venv .venv
#Windows
.venv/Scripts/activate

# Mac/Linux
source .venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt

5. DATASET

The dataset was obtained from Kaggle Mushroom Dataset.

It only contains categorical variables describing physical characteristics of mushrooms and a target variable.

It is 'class' which contains if it is 'edible' or 'poisonous'.

6. DATA CLEANING

The preprocessing steps included:

  • Converting all columns to categorical type.
  • Checking for null values.
  • Handling the '?' category in 'stalk-root' using KNN Imputation.
  • Removing the 'veil-type' column (only one category).
  • Checking for duplicated rows. There were none.

7. EXPLORATORY DATA ANALYSIS

7.1. Univariate Analysis

Bar plots were used to analyze the distribution of each categorical variable.

Barplot image

The target variable has a good distribution.

7.2. Bivariate Analysis

We analyzed the relationship between the target variable (class) and each feature.

Stackedbar image

Some categories perfectly separate edible and poisonous mushrooms.

Crosstab

Here we can see that the variable 'odor' is helpful to predict if the mushroom is 'edible' or 'poisonous'. The categories in this variable are very differentiated. Only one categor 'n' (none, pink in the figure above) in 'odor' is a little bit mixed between both classes.

7.3. Multivariate Analysis

Cramér’s V was used to measure associations between categorical variables.

Cramer Matrix

Here we can confirm what we saw earlier: 'odor' and 'class' are very correlated.

8. MODELS

8.1. PCA (Dimensionality Reduction)

Principal Component Analysis was applied to reduce dimensionality.

  • 2 and 3 components were used for visualization (2D, 3D) for class and with clusters.

PCA 2D class

PCA 2D clusters

PCA 3D clusters

  • Around 8–12 components preserved most predictive power. Those components explain over of the cumulative variance.

Accuracy vs number of PCA

8.2. Random Forest Classifier

A Random Forest classifier was trained to predict mushroom class with and without reducing dimensionality.

Model Features/Components Variance explained Accuracy
Encoded 115 features 100% 1.0
PCA 10 components 64% 0.9993

This indicates that dimensionality reduction did not significantly reduce performance.

8.3. K-Means Clustering

The elbow method was used to determine the optimal number of clusters in PCA (8.1.).

Elbow Method

It shows that bewtween 4-6 clusters it would be okay.

We made figures of those clusters and it showed partial separation between edible and poisonous mushrooms. Only one cluster has it mixed.

clusters

9. CONCLUSIONS

  • The dataset is highly separable.
  • Random Forest achieved perfect accuracy.
  • PCA reduced dimensionality while maintaining performance.
  • Some categorical features strongly determine mushroom toxicity.
  • Unsupervised clustering partially reflects class separation.

10. AUTHOR

Name Contact
Rocio Lozano Caro LinkedIn GitHub

About

Unsupervised clustering analysis (PCA, KMeans) used and compared against a supervised one. | Se usa un análisis de clustering no supervisado (PCA, KMeans) para compararlo con uno supervisado.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%