A comprehensive implementation of semantic vector search for e-commerce faceted navigation, demonstrating how to replace traditional keyword-based filtering with AI-powered semantic clustering and search.
Author: Tejaswi Suresh https://www.linkedin.com/in/pseo/
This project showcases a modern approach to e-commerce search that goes beyond keyword matching. Instead of rigid faceted navigation, it uses:
- Semantic embeddings to understand product meaning
- Vector similarity search for intelligent product discovery
- Clustering analysis to automatically group similar products
- Performance monitoring to track search quality
- Embedding Generation: Convert product pages to semantic vectors using OpenAI's text-embedding models
- Vector Search Engine: Fast similarity search using FAISS (Facebook AI Similarity Search)
- Semantic Clustering: Automatically group products by semantic similarity using HDBSCAN
- Performance Analytics: Monitor and analyze search performance with detailed metrics
- Comparison Tools: Side-by-side comparison of semantic vs. keyword search
- Interactive Visualizations: Beautiful plots showing cluster relationships and search results
- Python 3.8+
- OpenAI API key (optional - demo mode available without it)
# Clone the repository
git clone https://github.com/frostyhand/vector-search-seo
cd vector-search-seo
# Run the automated setup script
./setup.shThe setup script will:
- ✅ Check Python installation
- 📦 Create virtual environment
- 📚 Install all dependencies
- 📝 Create .env file from template
- 🎯 Provide next steps
-
Clone the repository
git clone https://github.com/frostyhand/vector-search-seo cd vector-search-seo -
Create a virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables (Optional)
Copy the example environment file:
cp env.example .env
Edit
.envand add your OpenAI API key:OPENAI_API_KEY=your_openai_api_key_here
Get your API key: Visit OpenAI Platform to create an API key.
Note: The system works without an API key using mock embeddings for demonstration purposes.
# Make sure your virtual environment is activated
source .venv/bin/activate # Skip if you used setup.sh
# Launch the interactive demo
python demo.pyThis will launch an interactive menu where you can:
- Generate embeddings for sample product data
- Run semantic searches
- Compare semantic vs keyword search
- Perform clustering analysis
- View performance metrics
Generate embeddings:
python embedding_generator.pyRun semantic clustering:
python semantic_clustering.pyTest the search engine:
python vector_search_engine.pyMonitor performance:
python performance_monitor.py- Visit OpenAI Platform
- Sign in or create an account
- Click "Create new secret key"
- Copy the key and add it to your
.envfile
- text-embedding-3-small: ~$0.02 per 1M tokens
- Sample dataset: ~$0.01 to process
- Production usage: Monitor usage on OpenAI dashboard
The system includes a comprehensive demo mode that works without an API key:
- Uses semantically meaningful mock embeddings
- Demonstrates all features and concepts
- Perfect for learning and experimentation
{
"rank": 1,
"similarity_score": 0.89,
"title": "Men's Nike Running Shoes Under $100",
"url": "/shoes/running/men/nike/under-100",
"facets": {
"category": "shoes",
"subcategory": "running",
"gender": "men",
"brand": "nike",
"price_range": "under-100"
}
}- Cluster 0: Electronics (laptops, headphones)
- Cluster 1: Athletic footwear and apparel
- Cluster 2: Outdoor clothing and gear
- Noise: Miscellaneous items that don't fit clear patterns
- Search Accuracy: How well results match user intent
- Facet Coverage: Percentage of product facets found through search
- Response Time: Search latency and performance stats
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Product Pages │───▶│ Embedding │───▶│ Vector Store │
│ (text + facets)│ │ Generation │ │ (FAISS index) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌─────────────────┐ ┌──────────────────┐ │
│ Search Results │◀───│ Similarity │◀────────────┘
│ (ranked) │ │ Search │
└─────────────────┘ └──────────────────┘
▲
┌──────────────────┐
│ Query Embedding │
│ (real-time) │
└──────────────────┘
├── README.md # This file
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
├── setup.sh # Automated setup script
├── env.example # Environment variables template
├── .gitignore # Git ignore rules
├── demo.py # Interactive demonstration
├── embedding_generator.py # Generate product embeddings
├── vector_search_engine.py # Semantic search implementation
├── semantic_clustering.py # Product clustering analysis
└── performance_monitor.py # Search performance metrics
- Product Discovery: Find similar products across categories
- Recommendation Systems: "Customers who viewed this also liked..."
- Search Autocomplete: Intelligent query suggestions
- Inventory Analysis: Identify product gaps and opportunities
- Content Management: Organize articles, documents, and media
- Customer Support: Intelligent help article suggestions
- Knowledge Bases: Semantic documentation search
- Data Analysis: Cluster and analyze large text datasets
- Model: text-embedding-3-small (OpenAI)
- Dimensions: 1536
- Context Length: 8191 tokens
- Performance: Fast, cost-effective, high-quality
- Index: FAISS IndexFlatIP (Inner Product)
- Similarity: Cosine similarity with L2 normalization
- Speed: Sub-millisecond search on 10K+ products
- Scalability: Handles millions of products efficiently
- Algorithm: HDBSCAN (Hierarchical Density-Based Spatial Clustering)
- Advantages: Finds natural clusters, handles noise, no cluster count requirement
- Visualization: UMAP dimensionality reduction for 2D plotting
Replace the sample data in embedding_generator.py with your own product catalog:
def create_custom_product_data(self):
return [
{
'url': '/your/product/url',
'title': 'Your Product Title',
'content': 'Your product description',
'facets': {
'category': 'your_category',
'price': 'your_price_range',
# ... other facets
}
}
# ... more products
]For production use, consider:
- Batch Processing: Process embeddings offline for large catalogs
- Caching: Cache embeddings and search results
- Database Integration: Store embeddings in vector databases (Pinecone, Weaviate, Chroma)
- API Wrapper: Create REST API endpoints for search functionality
- Monitoring: Implement logging and analytics for search behavior
- Batch Size: Optimize embedding generation batch sizes
- Index Type: Consider different FAISS index types for your use case
- Dimensionality: Experiment with different embedding dimensions
- Filtering: Implement pre-filtering for large catalogs
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Commit your changes:
git commit -am 'Add some feature' - Push to the branch:
git push origin feature-name - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This is a demonstration project showcasing semantic search concepts. For production use:
- Implement proper error handling and logging
- Add comprehensive testing
- Consider scalability and performance requirements
- Review security implications
- Monitor API usage and costs
- Issues: Report bugs and request features via GitHub Issues
- Documentation: Check the inline code documentation
- Questions: Feel free to open a discussion or issue
Made with ❤️ wolfnuker