Hippo 🦛

pip install hippo-llm | 中文文档 | Examples

Search your documents locally. BM25 works in 30 seconds, upgrade to hybrid when you need it.

No ChromaDB. No cloud API. No jieba. One pip install.

30-second search

from hippo.embedding import VectorStore

# sparse mode: BM25 only, zero extra dependencies
store = VectorStore("docs.db")  # default mode="sparse"

store.add_batch([
    {"text": "Pipeline parallelism splits layers across devices"},
    {"text": "BM25 handles exact keyword matches"},
    {"text": "Speculative decoding improves latency by 2-3x"},
])

results = store.search("how to run big models on small GPUs", top_k=5)
for doc in results:
    print(f"[{doc.score:.3f}] {doc.text}")

No external vector DB. No embedding model download. SQLite for persistence. Works offline immediately.

→ Need semantic search? pip install hippo-llm[embedding] and switch to mode="hybrid" — same API, adds dense vectors + RRF fusion. See hybrid example ↓

Chinese-optimized: Built-in tokenizer with stop words. No jieba dependency.

store.add_batch([
    {"text": "管道并行将模型层拆分到多台设备上"},
    {"text": "混合搜索结合了关键词匹配和语义相似度"},
])
results = store.search("怎么在低端显卡上跑大模型", top_k=3)

Hybrid mode: BM25 + dense embedding with RRF fusion

pip install hippo-llm[embedding]

from hippo.embedding import EmbeddingEngine, VectorStore

engine = EmbeddingEngine(model="nomic-embed-text")  # local, no API key
store = VectorStore("docs.db", mode="hybrid", embedding_engine=engine)

store.add_batch([
    {"text": "Pipeline parallelism splits layers across devices"},
    {"text": "BM25 handles exact keyword matches"},
], engine=engine)

# RRF fusion: BM25 exact match + semantic similarity
results = store.search("how to run big models on small GPUs", engine=engine, top_k=5)
for doc in results:
    print(f"[{doc.score:.3f}] {doc.text}")

Full RAG example with local LLM (hybrid mode)

from hippo.embedding import EmbeddingEngine, VectorStore
import openai

# 1. Index documents (one-time)
engine = EmbeddingEngine(model="nomic-embed-text")
store = VectorStore("knowledge.db", mode="hybrid", embedding_engine=engine)

documents = [
    "Hippo splits model layers across multiple devices using TCP.",
    "Each device only loads its shard of layers, reducing memory per device.",
    "The loop detector catches semantic repetition using Jaccard similarity.",
    "BM25 hybrid search combines keyword matching with semantic similarity.",
]
store.add_batch([{"text": d} for d in documents], engine=engine)

# 2. RAG query
query = "how does hippo handle memory?"
results = store.search(query, engine=engine, top_k=2)
context = "\n".join(doc.text for doc in results)

# 3. Generate answer with local LLM
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="qwen3-30b-a3b-q3",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": query}
    ]
)
print(response.choices[0].message.content)

Why Hippo for embedding?

Every RAG pipeline, semantic router, and agent memory layer needs embeddings. Most people call cloud APIs (OpenAI, Cohere) and pay per token. Hippo runs everything locally.

Problem	Hippo's answer
Cloud embedding APIs: latency + cost + privacy	Local embedding, zero network calls
Installing ChromaDB + connector + embedding model separately	`VectorStore(mode="hybrid")` — one class, SQLite-backed
Chinese search needs jieba + extra config	Built-in tokenizer, zero config
BM25 vs dense — which to pick?	RRF fusion combines both, no choosing needed
>10K documents?	ANN index, sub-ms queries
Pipeline parallelism for big models	Split any GGUF across machines (Mac + PC mixed)

Real numbers from production use:

Metric	Value
BM25 query latency	<1ms
Dense search (bge-small-zh, 512d)	5ms
Hybrid RRF fusion	<1ms overhead
ANN index build (10K docs)	~2s
OOD accuracy (110 queries)	85.5% top-1 (bge-small-zh), 92.7% (bge-m3)
Keyword + embedding fusion	91.8% top-1 (互补, TOOLS #94)

Inference: run big models on cheap hardware

hippo-pipeline serve --model qwen3-30b-a3b-q3 --mode standalone
# → OpenAI-compatible API at localhost:8000/v1/chat/completions

Two-machine pipeline parallelism

# Machine 1
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 0

# Machine 2
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 1 \
  --coordinator http://192.168.1.10:9000

Split the model across machines. Plain TCP, no MPI. Mac + PC mixed.

What's inside

Feature	Details
Embedding + Hybrid Search	Dense + BM25 + RRF fusion. SQLite-backed, sub-ms queries.
Chinese-optimized BM25	Built-in tokenizer with stop words. No jieba needed.
ANN Index	Approximate nearest neighbor for large collections (>10K docs).
Pipeline Parallelism	Split any GGUF model across N machines. Mac + PC mixed.
Loop Detection	Jaccard-similarity detector catches semantic repetition.
OpenAI-Compatible API	Drop-in `/v1/chat/completions`. Works with LangChain, LlamaIndex.
Auto Memory Budget	Calculates shard splits from available VRAM automatically.

When to use Hippo

You want...	Use this
Search documents in 30 seconds	`VectorStore("docs.db")` — BM25, zero config
Search Chinese documents	Built-in tokenizer, zero config
Agent memory / semantic routing	`pip install hippo-llm[embedding]` → hybrid RRF
Local inference on one machine	`--mode standalone` with any GGUF model
Run a model too big for one device	`--mode pipeline` across 2+ machines

Install

pip install hippo-llm

Zero dependencies beyond numpy. BM25 search works immediately.

pip install hippo-llm[embedding]  # add dense vectors + hybrid RRF fusion

Requirements: Python 3.10+. Dense embedding needs a local model (via Ollama or sentence-transformers cache).

Roadmap

v0.3: ANN index + Chinese tokenizer + hybrid RRF + sparse default ✅
v0.4: Built-in embedding models (bge-small-zh 5ms sweet spot), reranker, real-time routing (<10ms)
v0.5: Agent memory layer (embedding-backed episodic memory)
v0.6: Multi-shard support (>2 devices), speculative decoding

Benchmarks

Setup	Model	Speed
Mac Mini M2 (16GB)	Qwen3-4B-Q4	41 tok/s
RTX 5060 Ti (16GB)	Qwen3-14B-Q4	41 tok/s
2× Mac Mini (16GB each)	Qwen3-30B-A3B-Q3	78 tok/s
Mac Mini M2 (16GB)	Qwen3-30B-A3B-Q3	24 tok/s

License

MIT

Author

lawcontinue — GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github		.github
config		config
docs		docs
embedding		embedding
examples		examples
hippo		hippo
pipeline		pipeline
scheduler		scheduler
scripts		scripts
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.md.bak		README.md.bak
README_CN.md		README_CN.md
SECURITY.md		SECURITY.md
acceleration.py		acceleration.py
blog_wechat_hippo_v030_cover.jpg		blog_wechat_hippo_v030_cover.jpg
blog_wechat_hippo_v030_illust_1.jpg		blog_wechat_hippo_v030_illust_1.jpg
blog_wechat_hippo_v030_illust_2.jpg		blog_wechat_hippo_v030_illust_2.jpg
cover.jpg		cover.jpg
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
test_acceleration.py		test_acceleration.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hippo 🦛

30-second search

Why Hippo for embedding?

Inference: run big models on cheap hardware

What's inside

When to use Hippo

Install

Roadmap

Benchmarks

License

Author

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hippo 🦛

30-second search

Why Hippo for embedding?

Inference: run big models on cheap hardware

What's inside

When to use Hippo

Install

Roadmap

Benchmarks

License

Author

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages