Vector DB 100M

Proof that Product Quantization enables 100M vectors on a single machine.

What This Proves

Claim	How Code Demonstrates It
Raw float32 at 100M = 307GB	Calculate and print theoretical size
PQ compresses to ~10GB	Build IVF-PQ index, measure actual file size
Recall is acceptable	Compare IVF-PQ results to exact brute-force
Latency is fast	Time queries on both indexes
Extrapolation is valid	Run at 1M, show linear scaling to 100M

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the demo
python scripts/run_demo.py

Expected Output

============================================================
VECTOR DB MEMORY DEMO
============================================================

[1] THEORETICAL STORAGE (before running anything)
    Database size: 1,000,000 vectors x 768 dims
    Float32 raw:   2.86 GB
    PQ codes:      0.0894 GB  (m=96)
    Compression:   32.0x

[2] GENERATING SYNTHETIC VECTORS...
    Train: (200000, 768), DB: (1000000, 768), Query: (10000, 768)

[3] BUILDING EXACT INDEX (FlatL2)...
    Build: ~0.5s
    Search 10,000 queries: ~15s

[4] BUILDING IVF-PQ INDEX...
    Build: ~80s
    Search 10,000 queries: ~2s

[5] RECALL EVALUATION
    Recall@10: ~0.81 (81%)

[6] INDEX SIZES ON DISK
    Flat index:   3.07 GB
    IVF-PQ index: ~0.12 GB
    Reduction:    ~26x

[7] EXTRAPOLATION TO 100M VECTORS
    Raw float32:     286.1 GB
    PQ codes only:   8.9 GB
    IVF-PQ index:    ~12 GB (estimated)
    Fits in 64GB RAM? YES

============================================================
CONCLUSION: PQ enables 100M vectors on a single machine
============================================================

About the Synthetic Data

The demo generates structured synthetic vectors that mimic real embeddings:

Low-rank structure: Real embeddings live in a lower-dimensional manifold
Clustered: Vectors group around semantic topics
Correlated dimensions: Not random noise across all 768 dimensions

This achieves 80-90% recall with IVF-PQ, matching what you'd see with real embeddings from OpenAI, Cohere, or similar models.

Parameter Tuning

Edit src/config.py to adjust:

n_db: Database size (default 1M, try 5M for more confidence)
m: PQ subspaces (higher = more compression, lower recall)
nprobe: Partitions to search (higher = better recall, slower)
nlist: IVF partitions (more = faster search, slower build)

Project Structure

vectordb-100m/
├── README.md
├── requirements.txt
├── src/
│   ├── config.py      # All parameters
│   ├── data.py        # Synthetic vector generation
│   ├── index_build.py # Build Flat and IVF-PQ indexes
│   ├── eval.py        # Recall@k calculation
│   └── utils.py       # Memory helpers
├── scripts/
│   └── run_demo.py    # One-command proof
└── notebooks/
    └── walkthrough.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector DB 100M

What This Proves

Quick Start

Expected Output

About the Synthetic Data

Parameter Tuning

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Vector DB 100M

What This Proves

Quick Start

Expected Output

About the Synthetic Data

Parameter Tuning

Project Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages