Skip to content

frostyhand/vector-search-seo

Repository files navigation

Semantic Vector Search for E-commerce

A comprehensive implementation of semantic vector search for e-commerce faceted navigation, demonstrating how to replace traditional keyword-based filtering with AI-powered semantic clustering and search.

Author: Tejaswi Suresh https://www.linkedin.com/in/pseo/

🎯 Overview

This project showcases a modern approach to e-commerce search that goes beyond keyword matching. Instead of rigid faceted navigation, it uses:

  • Semantic embeddings to understand product meaning
  • Vector similarity search for intelligent product discovery
  • Clustering analysis to automatically group similar products
  • Performance monitoring to track search quality

🚀 Features

  • Embedding Generation: Convert product pages to semantic vectors using OpenAI's text-embedding models
  • Vector Search Engine: Fast similarity search using FAISS (Facebook AI Similarity Search)
  • Semantic Clustering: Automatically group products by semantic similarity using HDBSCAN
  • Performance Analytics: Monitor and analyze search performance with detailed metrics
  • Comparison Tools: Side-by-side comparison of semantic vs. keyword search
  • Interactive Visualizations: Beautiful plots showing cluster relationships and search results

📋 Prerequisites

  • Python 3.8+
  • OpenAI API key (optional - demo mode available without it)

🛠️ Installation

Option 1: Automated Setup (Recommended)

# Clone the repository
git clone https://github.com/frostyhand/vector-search-seo
cd vector-search-seo

# Run the automated setup script
./setup.sh

The setup script will:

  • ✅ Check Python installation
  • 📦 Create virtual environment
  • 📚 Install all dependencies
  • 📝 Create .env file from template
  • 🎯 Provide next steps

Option 2: Manual Setup

  1. Clone the repository

    git clone https://github.com/frostyhand/vector-search-seo
    cd vector-search-seo
  2. Create a virtual environment

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Set up environment variables (Optional)

    Copy the example environment file:

    cp env.example .env

    Edit .env and add your OpenAI API key:

    OPENAI_API_KEY=your_openai_api_key_here

    Get your API key: Visit OpenAI Platform to create an API key.

    Note: The system works without an API key using mock embeddings for demonstration purposes.

🎮 Quick Start

Interactive Demo (Recommended)

# Make sure your virtual environment is activated
source .venv/bin/activate  # Skip if you used setup.sh

# Launch the interactive demo
python demo.py

This will launch an interactive menu where you can:

  • Generate embeddings for sample product data
  • Run semantic searches
  • Compare semantic vs keyword search
  • Perform clustering analysis
  • View performance metrics

Individual Components

Generate embeddings:

python embedding_generator.py

Run semantic clustering:

python semantic_clustering.py

Test the search engine:

python vector_search_engine.py

Monitor performance:

python performance_monitor.py

🔧 API Key Setup

Getting Your OpenAI API Key

  1. Visit OpenAI Platform
  2. Sign in or create an account
  3. Click "Create new secret key"
  4. Copy the key and add it to your .env file

Cost Considerations

  • text-embedding-3-small: ~$0.02 per 1M tokens
  • Sample dataset: ~$0.01 to process
  • Production usage: Monitor usage on OpenAI dashboard

Demo Mode (No API Key Required)

The system includes a comprehensive demo mode that works without an API key:

  • Uses semantically meaningful mock embeddings
  • Demonstrates all features and concepts
  • Perfect for learning and experimentation

📊 Understanding the Output

Semantic Search Results

{
  "rank": 1,
  "similarity_score": 0.89,
  "title": "Men's Nike Running Shoes Under $100",
  "url": "/shoes/running/men/nike/under-100",
  "facets": {
    "category": "shoes",
    "subcategory": "running",
    "gender": "men",
    "brand": "nike",
    "price_range": "under-100"
  }
}

Clustering Analysis

  • Cluster 0: Electronics (laptops, headphones)
  • Cluster 1: Athletic footwear and apparel
  • Cluster 2: Outdoor clothing and gear
  • Noise: Miscellaneous items that don't fit clear patterns

Performance Metrics

  • Search Accuracy: How well results match user intent
  • Facet Coverage: Percentage of product facets found through search
  • Response Time: Search latency and performance stats

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Product Pages  │───▶│  Embedding       │───▶│  Vector Store   │
│  (text + facets)│    │  Generation      │    │  (FAISS index)  │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                         │
┌─────────────────┐    ┌──────────────────┐             │
│  Search Results │◀───│  Similarity      │◀────────────┘
│  (ranked)       │    │  Search          │
└─────────────────┘    └──────────────────┘
                                ▲
                       ┌──────────────────┐
                       │  Query Embedding │
                       │  (real-time)     │
                       └──────────────────┘

📁 Project Structure

├── README.md                          # This file
├── LICENSE                           # MIT License
├── requirements.txt                   # Python dependencies
├── setup.sh                          # Automated setup script
├── env.example                       # Environment variables template
├── .gitignore                        # Git ignore rules
├── demo.py                          # Interactive demonstration
├── embedding_generator.py           # Generate product embeddings
├── vector_search_engine.py          # Semantic search implementation
├── semantic_clustering.py           # Product clustering analysis
└── performance_monitor.py           # Search performance metrics

🎯 Use Cases

E-commerce Applications

  • Product Discovery: Find similar products across categories
  • Recommendation Systems: "Customers who viewed this also liked..."
  • Search Autocomplete: Intelligent query suggestions
  • Inventory Analysis: Identify product gaps and opportunities

Beyond E-commerce

  • Content Management: Organize articles, documents, and media
  • Customer Support: Intelligent help article suggestions
  • Knowledge Bases: Semantic documentation search
  • Data Analysis: Cluster and analyze large text datasets

🔬 Technical Details

Embedding Model

  • Model: text-embedding-3-small (OpenAI)
  • Dimensions: 1536
  • Context Length: 8191 tokens
  • Performance: Fast, cost-effective, high-quality

Search Algorithm

  • Index: FAISS IndexFlatIP (Inner Product)
  • Similarity: Cosine similarity with L2 normalization
  • Speed: Sub-millisecond search on 10K+ products
  • Scalability: Handles millions of products efficiently

Clustering Method

  • Algorithm: HDBSCAN (Hierarchical Density-Based Spatial Clustering)
  • Advantages: Finds natural clusters, handles noise, no cluster count requirement
  • Visualization: UMAP dimensionality reduction for 2D plotting

🚀 Advanced Usage

Custom Product Data

Replace the sample data in embedding_generator.py with your own product catalog:

def create_custom_product_data(self):
    return [
        {
            'url': '/your/product/url',
            'title': 'Your Product Title',
            'content': 'Your product description',
            'facets': {
                'category': 'your_category',
                'price': 'your_price_range',
                # ... other facets
            }
        }
        # ... more products
    ]

Production Deployment

For production use, consider:

  • Batch Processing: Process embeddings offline for large catalogs
  • Caching: Cache embeddings and search results
  • Database Integration: Store embeddings in vector databases (Pinecone, Weaviate, Chroma)
  • API Wrapper: Create REST API endpoints for search functionality
  • Monitoring: Implement logging and analytics for search behavior

Performance Optimization

  • Batch Size: Optimize embedding generation batch sizes
  • Index Type: Consider different FAISS index types for your use case
  • Dimensionality: Experiment with different embedding dimensions
  • Filtering: Implement pre-filtering for large catalogs

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Commit your changes: git commit -am 'Add some feature'
  5. Push to the branch: git push origin feature-name
  6. Submit a pull request

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This is a demonstration project showcasing semantic search concepts. For production use:

  • Implement proper error handling and logging
  • Add comprehensive testing
  • Consider scalability and performance requirements
  • Review security implications
  • Monitor API usage and costs

🙋‍♀️ Support

  • Issues: Report bugs and request features via GitHub Issues
  • Documentation: Check the inline code documentation
  • Questions: Feel free to open a discussion or issue

🔗 Related Resources


Made with ❤️ wolfnuker

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors