Skip to content

MDVR9980/Golestan-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌹 Golestan Clustering Analysis

Python NLP Status

πŸ“– Overview

This project applies Unsupervised Machine Learning techniques to analyze and cluster the text of Saadi's Golestan, one of the most significant literary works in the Persian language.

The goal is to group anecdotes (Hikayats) based on semantic similarity and compare the machine-generated clusters with the original 8 chapters (Babs) defined by Saadi. We utilize TF-IDF for vectorization, LSA for dimensionality reduction, and compare K-Means vs. DBSCAN algorithms.

πŸš€ Key Features

  • Advanced Persian Preprocessing:
    • Normalization and Tokenization using Hazm.
    • Removal of general stopwords and a curated list of archaic/literary stopwords specific to classical Persian texts (e.g., "گفΨͺ", "Ω…Ω„Ϊ©", "Ψ΄Ω†ΫŒΨ―Ω…").
    • Lemmatization to reduce word variations.
  • Feature Extraction: TF-IDF Vectorization with L2 Normalization.
  • Dimensionality Reduction: Latent Semantic Analysis (LSA/TruncatedSVD) to handle sparsity and improve clustering performance.
  • Clustering Algorithms:
    • K-Means: With Elbow Method and Silhouette Analysis to find optimal $k$.
    • DBSCAN: Density-based clustering with automatic Epsilon determination using K-distance graphs.
  • Visualization: Confusion Matrices (Heatmaps) to evaluate cluster alignment with real chapters.

πŸ“‚ Project Structure

Golestan_Clustering/
β”‚
β”œβ”€β”€ data/
β”‚   └── golestan.csv       # The dataset containing anecdotes and labels
β”‚
β”œβ”€β”€ outputs/               # Saved visualizations (optional)
β”‚
β”œβ”€β”€ main.ipynb             # The main Jupyter Notebook with all logic
β”‚
β”œβ”€β”€ README.md              # Project documentation
β”‚
└── requirements.txt       # Python dependencies‍‍‍‍‍‍‍

πŸ› οΈ Installation

  1. Clone the repository:

    git clone https://github.com/mdvr0480/Golestan-Clustering.git
    cd Golestan-Clustering
  2. Create a Virtual Environment (Optional but recommended):

    python -m venv venv
    # Linux/Mac
    source venv/bin/activate
    # Windows
    .\venv\Scripts\activate
  3. Install Dependencies:

    pip install -r requirements.txt

πŸ“Š Methodology & Results

1. Preprocessing

We cleaned the text by removing punctuation and high-frequency verbs that do not carry semantic meaning in the context of topic modeling. This ensures the model focuses on themes (e.g., "Justice", "Love", "Education") rather than grammar.

2. K-Means Analysis

  • Optimal K: Using the Elbow Method and Silhouette Score, we identified the optimal number of clusters (around $k=8$).
  • Interpretation: The resulting clusters showed overlap with the original chapters, successfully grouping stories about "Kings and Rulers" separately from stories about "Dervishes" or "Education".

3. DBSCAN Analysis

  • Parameter Tuning: We used a K-distance graph to find the optimal eps value.
  • Noise Handling: DBSCAN identified outliers (Label -1), which is expected in literary texts where some anecdotes are unique and do not fit strictly into dense semantic clusters.

πŸ“ˆ Visualizations

The project generates several key plots:

  • Elbow & Silhouette Graph: To decide the number of clusters.
  • K-distance Graph: To tune DBSCAN.
  • Confusion Matrix Heatmap: To visualize the correlation between predicted clusters and actual chapters.

πŸ“¦ Libraries Used

  • Pandas & NumPy: Data manipulation.
  • Scikit-Learn: Machine Learning algorithms (KMeans, DBSCAN, TF-IDF, SVD).
  • Hazm: Persian NLP toolkit.
  • Matplotlib & Seaborn: Data visualization.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

🌹 Unsupervised Semantic Clustering of Saadi's Golestan. A Persian NLP project comparing K-Means & DBSCAN algorithms to regroup anecdotes based on semantic similarity using TF-IDF & LSA.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors