Semantic Vector Search for E-commerce

A comprehensive implementation of semantic vector search for e-commerce faceted navigation, demonstrating how to replace traditional keyword-based filtering with AI-powered semantic clustering and search.

Author: Tejaswi Suresh https://www.linkedin.com/in/pseo/

🎯 Overview

This project showcases a modern approach to e-commerce search that goes beyond keyword matching. Instead of rigid faceted navigation, it uses:

Semantic embeddings to understand product meaning
Vector similarity search for intelligent product discovery
Clustering analysis to automatically group similar products
Performance monitoring to track search quality

🚀 Features

Embedding Generation: Convert product pages to semantic vectors using OpenAI's text-embedding models
Vector Search Engine: Fast similarity search using FAISS (Facebook AI Similarity Search)
Semantic Clustering: Automatically group products by semantic similarity using HDBSCAN
Performance Analytics: Monitor and analyze search performance with detailed metrics
Comparison Tools: Side-by-side comparison of semantic vs. keyword search
Interactive Visualizations: Beautiful plots showing cluster relationships and search results

📋 Prerequisites

Python 3.8+
OpenAI API key (optional - demo mode available without it)

🛠️ Installation

Option 1: Automated Setup (Recommended)

# Clone the repository
git clone https://github.com/frostyhand/vector-search-seo
cd vector-search-seo

# Run the automated setup script
./setup.sh

The setup script will:

✅ Check Python installation
📦 Create virtual environment
📚 Install all dependencies
📝 Create .env file from template
🎯 Provide next steps

Option 2: Manual Setup

Clone the repository

git clone https://github.com/frostyhand/vector-search-seo
cd vector-search-seo

Create a virtual environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Set up environment variables (Optional)

Copy the example environment file:
```
cp env.example .env
```
Edit .env and add your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key_here
```
Get your API key: Visit OpenAI Platform to create an API key.

Note: The system works without an API key using mock embeddings for demonstration purposes.

🎮 Quick Start

Interactive Demo (Recommended)

# Make sure your virtual environment is activated
source .venv/bin/activate  # Skip if you used setup.sh

# Launch the interactive demo
python demo.py

This will launch an interactive menu where you can:

Generate embeddings for sample product data
Run semantic searches
Compare semantic vs keyword search
Perform clustering analysis
View performance metrics

Individual Components

Generate embeddings:

python embedding_generator.py

Run semantic clustering:

python semantic_clustering.py

Test the search engine:

python vector_search_engine.py

Monitor performance:

python performance_monitor.py

🔧 API Key Setup

Getting Your OpenAI API Key

Visit OpenAI Platform
Sign in or create an account
Click "Create new secret key"
Copy the key and add it to your .env file

Cost Considerations

text-embedding-3-small: ~$0.02 per 1M tokens
Sample dataset: ~$0.01 to process
Production usage: Monitor usage on OpenAI dashboard

Demo Mode (No API Key Required)

The system includes a comprehensive demo mode that works without an API key:

Uses semantically meaningful mock embeddings
Demonstrates all features and concepts
Perfect for learning and experimentation

📊 Understanding the Output

Semantic Search Results

{
  "rank": 1,
  "similarity_score": 0.89,
  "title": "Men's Nike Running Shoes Under $100",
  "url": "/shoes/running/men/nike/under-100",
  "facets": {
    "category": "shoes",
    "subcategory": "running",
    "gender": "men",
    "brand": "nike",
    "price_range": "under-100"
  }
}

Clustering Analysis

Cluster 0: Electronics (laptops, headphones)
Cluster 1: Athletic footwear and apparel
Cluster 2: Outdoor clothing and gear
Noise: Miscellaneous items that don't fit clear patterns

Performance Metrics

Search Accuracy: How well results match user intent
Facet Coverage: Percentage of product facets found through search
Response Time: Search latency and performance stats

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Product Pages  │───▶│  Embedding       │───▶│  Vector Store   │
│  (text + facets)│    │  Generation      │    │  (FAISS index)  │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                         │
┌─────────────────┐    ┌──────────────────┐             │
│  Search Results │◀───│  Similarity      │◀────────────┘
│  (ranked)       │    │  Search          │
└─────────────────┘    └──────────────────┘
                                ▲
                       ┌──────────────────┐
                       │  Query Embedding │
                       │  (real-time)     │
                       └──────────────────┘

📁 Project Structure

├── README.md                          # This file
├── LICENSE                           # MIT License
├── requirements.txt                   # Python dependencies
├── setup.sh                          # Automated setup script
├── env.example                       # Environment variables template
├── .gitignore                        # Git ignore rules
├── demo.py                          # Interactive demonstration
├── embedding_generator.py           # Generate product embeddings
├── vector_search_engine.py          # Semantic search implementation
├── semantic_clustering.py           # Product clustering analysis
└── performance_monitor.py           # Search performance metrics

🎯 Use Cases

E-commerce Applications

Product Discovery: Find similar products across categories
Recommendation Systems: "Customers who viewed this also liked..."
Search Autocomplete: Intelligent query suggestions
Inventory Analysis: Identify product gaps and opportunities

Beyond E-commerce

Content Management: Organize articles, documents, and media
Customer Support: Intelligent help article suggestions
Knowledge Bases: Semantic documentation search
Data Analysis: Cluster and analyze large text datasets

🔬 Technical Details

Embedding Model

Model: text-embedding-3-small (OpenAI)
Dimensions: 1536
Context Length: 8191 tokens
Performance: Fast, cost-effective, high-quality

Search Algorithm

Index: FAISS IndexFlatIP (Inner Product)
Similarity: Cosine similarity with L2 normalization
Speed: Sub-millisecond search on 10K+ products
Scalability: Handles millions of products efficiently

Clustering Method

Algorithm: HDBSCAN (Hierarchical Density-Based Spatial Clustering)
Advantages: Finds natural clusters, handles noise, no cluster count requirement
Visualization: UMAP dimensionality reduction for 2D plotting

🚀 Advanced Usage

Custom Product Data

Replace the sample data in embedding_generator.py with your own product catalog:

def create_custom_product_data(self):
    return [
        {
            'url': '/your/product/url',
            'title': 'Your Product Title',
            'content': 'Your product description',
            'facets': {
                'category': 'your_category',
                'price': 'your_price_range',
                # ... other facets
            }
        }
        # ... more products
    ]

Production Deployment

For production use, consider:

Batch Processing: Process embeddings offline for large catalogs
Caching: Cache embeddings and search results
Database Integration: Store embeddings in vector databases (Pinecone, Weaviate, Chroma)
API Wrapper: Create REST API endpoints for search functionality
Monitoring: Implement logging and analytics for search behavior

Performance Optimization

Batch Size: Optimize embedding generation batch sizes
Index Type: Consider different FAISS index types for your use case
Dimensionality: Experiment with different embedding dimensions
Filtering: Implement pre-filtering for large catalogs

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Commit your changes: git commit -am 'Add some feature'
Push to the branch: git push origin feature-name
Submit a pull request

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This is a demonstration project showcasing semantic search concepts. For production use:

Implement proper error handling and logging
Add comprehensive testing
Consider scalability and performance requirements
Review security implications
Monitor API usage and costs

🙋‍♀️ Support

Issues: Report bugs and request features via GitHub Issues
Documentation: Check the inline code documentation
Questions: Feel free to open a discussion or issue

🔗 Related Resources

Made with ❤️ wolfnuker

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
embedding_generator.py		embedding_generator.py
env.example		env.example
performance_monitor.py		performance_monitor.py
requirements.txt		requirements.txt
semantic_clustering.py		semantic_clustering.py
setup.sh		setup.sh
vector_search_engine.py		vector_search_engine.py

Folders and files

Latest commit

History

Repository files navigation

Semantic Vector Search for E-commerce

🎯 Overview

🚀 Features

📋 Prerequisites

🛠️ Installation

Option 1: Automated Setup (Recommended)

Option 2: Manual Setup

🎮 Quick Start

Interactive Demo (Recommended)

Individual Components

🔧 API Key Setup

Getting Your OpenAI API Key

Cost Considerations

Demo Mode (No API Key Required)

📊 Understanding the Output

Semantic Search Results

Clustering Analysis

Performance Metrics

🏗️ Architecture

📁 Project Structure

🎯 Use Cases

E-commerce Applications

Beyond E-commerce

🔬 Technical Details

Embedding Model

Search Algorithm

Clustering Method

🚀 Advanced Usage

Custom Product Data

Production Deployment

Performance Optimization

🤝 Contributing

📜 License

⚠️ Disclaimer

🙋‍♀️ Support

🔗 Related Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages