Skip to content

NafiAsib/semantic-search-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Search API

A self-hostable search API to search products using vector embeddings.

Features two-stage retrieval (semantic search + reranking), distributed tracing, metrics, and configurable similarity thresholds.

Quick Start

  1. Clone the repository
    git clone git@github.com:NafiAsib/semantic-search-api.git
    cd semantic-search-api
  2. Dataset
  • Download the dataset into /dataset directory
curl -L -o $(pwd)/dataset/amazon-best-sellers-product-dataset.zip\
  https://www.kaggle.com/api/v1/datasets/download/nafiasib/amazon-best-sellers-product-dataset
  • Unzip & remove zip file
unzip dataset/amazon-best-sellers-product-dataset.zip -d dataset/
rm dataset/amazon-best-sellers-product-dataset.zip
  1. Configure environment

    cp .env.example .env # Edit .env with your settings
  2. Start services with Docker Compose

    make docker-up

    This starts:

    • PostgreSQL with pgvector extension
    • Jaeger for distributed tracing
    • Prometheus for metrics
    • Grafana for visualization
  3. Create database schema

    make db-reset
  4. Install Python dependencies

    uv sync
  5. Load your embedding and reranker models

    # Start llama.cpp server with Qwen3-Embedding-0.6B on port 8080
    # Make sure it's running at the URL specified in .env (default: http://192.168.0.110:8080)
    
    # Start llama.cpp server with Qwen3-Reranker-0.6B on port 8081
    # Make sure it's running at the URL specified in .env (default: http://192.168.0.110:8081)
  6. Generate embeddings for products

    python app/embedding.py

For initial testing, you may want to try out with only 500 products. As generating vector for 24K product will take some time. Uncomment df = df.head(500) in main function of app/embedding.py

  1. Start the API

    python app/main.py
  2. Test the search endpoint

    curl -X POST "http://localhost:8000/search" \
      -H "Content-Type: application/json" \
      -d '{"query": "wireless headphones"}'

Makefile Commands

  • make docker-up - Start all Docker services
  • make docker-down - Stop all Docker services
  • make db-create - Create database schema and extensions
  • make db-reset - Drop and recreate database (destroys data)
  • make db-status - Show database status and table info
  • make db-connect - Connect to database via psql
  • make help - Show all available commands

Services

Configuration

Key environment variables in .env:

Embedding Configuration:

  • EMBEDDING_API_URL - Your llama.cpp server endpoint (default: http://192.168.0.110:8080)
  • EMBEDDING_MODEL - Model name (e.g., Qwen3-Embedding-0.6B)
  • EMBEDDING_DIMENSION - Output dimensions (must match db.sql)

Reranker Configuration:

  • RERANKER_API_URL - Your llama.cpp reranker endpoint (default: http://192.168.0.110:8081)
  • RERANKER_MODEL - Reranker model name (default: Qwen/Qwen3-Reranker-0.6B)
  • RERANKER_CANDIDATES - Number of candidates to fetch for reranking (default: 20)

Search Configuration:

  • SEARCH_RESULT_LIMIT - Number of results to return (default: 3)
  • SEARCH_SIMILARITY_THRESHOLD - Minimum similarity score (default: 0.5)

Features

  • Two-Stage Retrieval: Semantic search with vector similarity + reranking for improved relevance
  • Structured Logging (structlog)
  • Distributed Tracing (OpenTelemetry + Jaeger):
  • Metrics (Prometheus + Grafana)

FAQ

Q1. Why only 1024 dimensions for embedding?

I'm using HNSW for vector indexing. With pgvector we can't use HNSW indexing with more than 2000 dimension. It's a limitation of postgresql. Read this PR for more details.

Q2. HNSW vs IVFFlat

HNSW (Hierarchical Navigable Small World): Constructs a multi-level graph for faster and highly accurate approximate nearest neighbor searches, generally offering better performance than IVFFlat but using more memory.

IVFFlat (Inverted File with Flat Compression): Partitions the vector space into clusters and only searches the most relevant clusters, trading some accuracy for speed.

HNSW is comparatively faster than IVFFlat.

Q3. Which embedding should I use?

Visit MTEB Leaderboard and choose one according to your hardware.

Don't forget to update EMBEDDING_DIMENSION in .env and vector(N) in db/schema.sql, then run make db-reset.

Q4. Why use a reranker?

Reranking significantly improves search quality. The two-stage approach:

  1. Semantic Search: Fast vector similarity retrieves 20 candidates
  2. Reranking: Qwen3-Reranker-0.6B analyzes query-document pairs for better relevance scoring

This gives you speed (from vector search) + accuracy (from reranking).

Q5. Which reranker should I use?

Qwen3-Reranker-0.6B is a great lightweight option. For other choices, check the MTEB Retrieval Leaderboard. Note: llama.cpp only supports models with reranking capability.

Q6. Do I need separate servers for embedding and reranking?

Yes, since llama.cpp runs one model per server. Run them on different ports (8080 for embedding, 8081 for reranker). You can use llama-swap if you want to optimize memory by swapping models dynamically.

About

API to search product semantically from a dataset

Topics

Resources

Stars

Watchers

Forks