Skip to content

Adesh2204/Analogy-Explorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕵️‍♂️ Analogy Explorer: Sherlock Edition

Analogy Explorer is a Natural Language Processing (NLP) engine that learns semantic relationships purely from context. Trained on The Adventures of Sherlock Holmes, this model can solve vector analogies (e.g., Holmes : Detective :: Watson : ?) and visualize word relationships in a 2D space.

Unlike standard models trained on Wikipedia, this project explores Data Sparsity and Narrative Bias by learning exclusively from a single novel.


🚀 Features

  • Vector Arithmetic: Solves analogies using the formula .
  • Interactive CLI: A robust command-line interface with color-coded output and error handling.
  • Small-Data Tuning: optimized hyperparameters (epochs=30, vector_size=100) to extract signal from a limited corpus (~100k words).
  • Bias Exploration: Demonstrates how AI reflects its training data (e.g., correlating "King" with "Bohemia" rather than generic royalty).

🛠️ Tech Stack

  • Language: Python 3.x
  • Core Logic: Gensim (Word2Vec)
  • Preprocessing: NLTK (Tokenization, Stopword removal)
  • Visualization: Matplotlib & Scikit-Learn (PCA for dimensionality reduction)
  • Frontend: Streamlit (Optional Web Dashboard)

⚙️ Installation

1. Clone the Repository

git clone https://github.com/Adesh2204/Analogy-Explorer.git
cd Analogy-Explorer

2. Set Up Environment

To avoid "Dependency Hell" (specifically with scipy versions), install the exact dependencies:

pip install -r requirements.txt

If you don't have a requirements file yet, use this command:

pip install "scipy<1.13" gensim nltk scikit-learn matplotlib streamlit

🖥️ Usage

Option A: The CLI (Terminal)

Run the script to interact with the model directly in your terminal.

python demo.py

Sample Input:

holmes detective watson

Sample Output:

Analogy: holmes is to detective as watson is to... DOCTOR (Confidence: 0.65)

Option B: The Web App (Streamlit)

Launch the modern dashboard for a visual experience.

streamlit run app.py

🧠 Methodology & Engineering Decisions

The Challenge: Data Sparsity

Standard NLP models are trained on billions of words. This model was trained on one book. To prevent overfitting and "noise," several engineering decisions were made:

  1. High Epochs (30): The model was forced to "re-read" the book 30 times to converge on stable vector representations.
  2. Reduced Dimensions (100d): A standard 300d vector space would be too sparse for a single novel. 100 dimensions provided the right balance of complexity and density.
  3. Strict Filtering: Words appearing fewer than 5 times were discarded to prevent the model from learning "garbage" correlations.

Interesting Results

The model reflects the world of Sherlock Holmes, not the real world:

  • Grammar: see -> saw :: go -> went (Learned verb tenses perfectly).
  • Context: holmes -> detective :: watson -> doctor (Learned professional roles).
  • ⚠️ Bias: man -> king :: woman -> ? results in Bohemia (referring to the King of Bohemia), not Queen. This highlights how dataset bias shapes AI behavior.

📂 Project Structure

Analogy-Explorer/
├── demo.py                  # Main CLI script for testing analogies
├── app.py                   # (Optional) Streamlit Dashboard
├── sherlock_analogy.model   # The trained binary Word2Vec model
├── training_script.ipynb    # (Optional) The Colab notebook used for training
├── requirements.txt         # Dependencies
└── README.md                # Project documentation

🤝 Contributing

Contributions are welcome! If you want to train this on a different corpus (e.g., Harry Potter or Pride and Prejudice), feel free to fork the repo and submit a Pull Request.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📜 License

Distributed under the MIT License. See LICENSE for more information.


Author

Adesh Kumar

About

Analogy Explorer is an NLP engine trained exclusively on The Adventures of Sherlock Holmes to solve vector analogies (e.g., Holmes : Detective :: Watson : ?) using Word2Vec. It features a custom CLI and Streamlit dashboard, demonstrating how to extract deep semantic signals from small datasets by overcoming data sparsity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors