This project applies Unsupervised Machine Learning techniques to analyze and cluster the text of Saadi's Golestan, one of the most significant literary works in the Persian language.
The goal is to group anecdotes (Hikayats) based on semantic similarity and compare the machine-generated clusters with the original 8 chapters (Babs) defined by Saadi. We utilize TF-IDF for vectorization, LSA for dimensionality reduction, and compare K-Means vs. DBSCAN algorithms.
-
Advanced Persian Preprocessing:
- Normalization and Tokenization using
Hazm. - Removal of general stopwords and a curated list of archaic/literary stopwords specific to classical Persian texts (e.g., "Ϊ―ΩΨͺ", "Ω ΩΪ©", "Ψ΄ΩΫΨ―Ω ").
- Lemmatization to reduce word variations.
- Normalization and Tokenization using
- Feature Extraction: TF-IDF Vectorization with L2 Normalization.
- Dimensionality Reduction: Latent Semantic Analysis (LSA/TruncatedSVD) to handle sparsity and improve clustering performance.
-
Clustering Algorithms:
-
K-Means: With Elbow Method and Silhouette Analysis to find optimal
$k$ . - DBSCAN: Density-based clustering with automatic Epsilon determination using K-distance graphs.
-
K-Means: With Elbow Method and Silhouette Analysis to find optimal
- Visualization: Confusion Matrices (Heatmaps) to evaluate cluster alignment with real chapters.
Golestan_Clustering/
β
βββ data/
β βββ golestan.csv # The dataset containing anecdotes and labels
β
βββ outputs/ # Saved visualizations (optional)
β
βββ main.ipynb # The main Jupyter Notebook with all logic
β
βββ README.md # Project documentation
β
βββ requirements.txt # Python dependenciesβββββββ
-
Clone the repository:
git clone https://github.com/mdvr0480/Golestan-Clustering.git cd Golestan-Clustering -
Create a Virtual Environment (Optional but recommended):
python -m venv venv # Linux/Mac source venv/bin/activate # Windows .\venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
We cleaned the text by removing punctuation and high-frequency verbs that do not carry semantic meaning in the context of topic modeling. This ensures the model focuses on themes (e.g., "Justice", "Love", "Education") rather than grammar.
-
Optimal K: Using the Elbow Method and Silhouette Score, we identified the optimal number of clusters (around
$k=8$ ). - Interpretation: The resulting clusters showed overlap with the original chapters, successfully grouping stories about "Kings and Rulers" separately from stories about "Dervishes" or "Education".
- Parameter Tuning: We used a K-distance graph to find the optimal
epsvalue. - Noise Handling: DBSCAN identified outliers (Label -1), which is expected in literary texts where some anecdotes are unique and do not fit strictly into dense semantic clusters.
The project generates several key plots:
- Elbow & Silhouette Graph: To decide the number of clusters.
- K-distance Graph: To tune DBSCAN.
- Confusion Matrix Heatmap: To visualize the correlation between predicted clusters and actual chapters.
- Pandas & NumPy: Data manipulation.
- Scikit-Learn: Machine Learning algorithms (KMeans, DBSCAN, TF-IDF, SVD).
- Hazm: Persian NLP toolkit.
- Matplotlib & Seaborn: Data visualization.
Contributions are welcome! Please feel free to submit a Pull Request.