Skip to content

NinoRisteski/vectordb-100m

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vector DB 100M

Proof that Product Quantization enables 100M vectors on a single machine.

What This Proves

Claim How Code Demonstrates It
Raw float32 at 100M = 307GB Calculate and print theoretical size
PQ compresses to ~10GB Build IVF-PQ index, measure actual file size
Recall is acceptable Compare IVF-PQ results to exact brute-force
Latency is fast Time queries on both indexes
Extrapolation is valid Run at 1M, show linear scaling to 100M

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the demo
python scripts/run_demo.py

Expected Output

============================================================
VECTOR DB MEMORY DEMO
============================================================

[1] THEORETICAL STORAGE (before running anything)
    Database size: 1,000,000 vectors x 768 dims
    Float32 raw:   2.86 GB
    PQ codes:      0.0894 GB  (m=96)
    Compression:   32.0x

[2] GENERATING SYNTHETIC VECTORS...
    Train: (200000, 768), DB: (1000000, 768), Query: (10000, 768)

[3] BUILDING EXACT INDEX (FlatL2)...
    Build: ~0.5s
    Search 10,000 queries: ~15s

[4] BUILDING IVF-PQ INDEX...
    Build: ~80s
    Search 10,000 queries: ~2s

[5] RECALL EVALUATION
    Recall@10: ~0.81 (81%)

[6] INDEX SIZES ON DISK
    Flat index:   3.07 GB
    IVF-PQ index: ~0.12 GB
    Reduction:    ~26x

[7] EXTRAPOLATION TO 100M VECTORS
    Raw float32:     286.1 GB
    PQ codes only:   8.9 GB
    IVF-PQ index:    ~12 GB (estimated)
    Fits in 64GB RAM? YES

============================================================
CONCLUSION: PQ enables 100M vectors on a single machine
============================================================

About the Synthetic Data

The demo generates structured synthetic vectors that mimic real embeddings:

  • Low-rank structure: Real embeddings live in a lower-dimensional manifold
  • Clustered: Vectors group around semantic topics
  • Correlated dimensions: Not random noise across all 768 dimensions

This achieves 80-90% recall with IVF-PQ, matching what you'd see with real embeddings from OpenAI, Cohere, or similar models.

Parameter Tuning

Edit src/config.py to adjust:

  • n_db: Database size (default 1M, try 5M for more confidence)
  • m: PQ subspaces (higher = more compression, lower recall)
  • nprobe: Partitions to search (higher = better recall, slower)
  • nlist: IVF partitions (more = faster search, slower build)

Project Structure

vectordb-100m/
├── README.md
├── requirements.txt
├── src/
│   ├── config.py      # All parameters
│   ├── data.py        # Synthetic vector generation
│   ├── index_build.py # Build Flat and IVF-PQ indexes
│   ├── eval.py        # Recall@k calculation
│   └── utils.py       # Memory helpers
├── scripts/
│   └── run_demo.py    # One-command proof
└── notebooks/
    └── walkthrough.ipynb

About

How Vector DBs Store 100M Embeddings on One Machine (and Still Search Fast)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors