Analogy Explorer is a Natural Language Processing (NLP) engine that learns semantic relationships purely from context. Trained on The Adventures of Sherlock Holmes, this model can solve vector analogies (e.g., Holmes : Detective :: Watson : ?) and visualize word relationships in a 2D space.
Unlike standard models trained on Wikipedia, this project explores Data Sparsity and Narrative Bias by learning exclusively from a single novel.
- Vector Arithmetic: Solves analogies using the formula .
- Interactive CLI: A robust command-line interface with color-coded output and error handling.
- Small-Data Tuning: optimized hyperparameters (
epochs=30,vector_size=100) to extract signal from a limited corpus (~100k words). - Bias Exploration: Demonstrates how AI reflects its training data (e.g., correlating "King" with "Bohemia" rather than generic royalty).
- Language: Python 3.x
- Core Logic:
Gensim(Word2Vec) - Preprocessing:
NLTK(Tokenization, Stopword removal) - Visualization:
Matplotlib&Scikit-Learn(PCA for dimensionality reduction) - Frontend:
Streamlit(Optional Web Dashboard)
git clone https://github.com/Adesh2204/Analogy-Explorer.git
cd Analogy-Explorer
To avoid "Dependency Hell" (specifically with scipy versions), install the exact dependencies:
pip install -r requirements.txt
If you don't have a requirements file yet, use this command:
pip install "scipy<1.13" gensim nltk scikit-learn matplotlib streamlit
Run the script to interact with the model directly in your terminal.
python demo.py
Sample Input:
holmes detective watson
Sample Output:
Analogy: holmes is to detective as watson is to... DOCTOR (Confidence: 0.65)
Launch the modern dashboard for a visual experience.
streamlit run app.py
Standard NLP models are trained on billions of words. This model was trained on one book. To prevent overfitting and "noise," several engineering decisions were made:
- High Epochs (30): The model was forced to "re-read" the book 30 times to converge on stable vector representations.
- Reduced Dimensions (100d): A standard 300d vector space would be too sparse for a single novel. 100 dimensions provided the right balance of complexity and density.
- Strict Filtering: Words appearing fewer than 5 times were discarded to prevent the model from learning "garbage" correlations.
The model reflects the world of Sherlock Holmes, not the real world:
- ✅ Grammar:
see -> saw::go -> went(Learned verb tenses perfectly). - ✅ Context:
holmes -> detective::watson -> doctor(Learned professional roles). ⚠️ Bias:man -> king::woman -> ?results inBohemia(referring to the King of Bohemia), not Queen. This highlights how dataset bias shapes AI behavior.
Analogy-Explorer/
├── demo.py # Main CLI script for testing analogies
├── app.py # (Optional) Streamlit Dashboard
├── sherlock_analogy.model # The trained binary Word2Vec model
├── training_script.ipynb # (Optional) The Colab notebook used for training
├── requirements.txt # Dependencies
└── README.md # Project documentation
Contributions are welcome! If you want to train this on a different corpus (e.g., Harry Potter or Pride and Prejudice), feel free to fork the repo and submit a Pull Request.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
Adesh Kumar
- GitHub: @Adesh2204