A disk-based inverted index engine built in Rust and Python. Ingests millions of documents from Common Crawl, indexes them into a segment-based structure with tiered merge, and serves hybrid BM25 + HNSW queries with WAND early termination.
This is not a wrapper around Elasticsearch. It is the engine.
| Component | Implementation |
|---|---|
| Posting list codec | VByte gap encoding, hand-written in Rust |
| Segment storage | mmap'd binary files with FST term dictionary |
| Merge policy | Tiered merge (Lucene-compatible), K-way heap merge in Rust |
| Write-ahead log | CRC32-checksummed, fsync'd, crash-recoverable |
| BM25 scorer | DAAT traversal with WAND early termination |
| Dense retrieval | 2-layer HNSW, f16 vector storage, written from scratch |
| Rank fusion | Reciprocal Rank Fusion (k=60, Cormack et al. 2009) |
| Deduplication | 64-bit SimHash, 8-band locality-sensitive hashing |
| Ingestion | Common Crawl WET streaming, tokenization, normalization |
| Serving | FastAPI with Prometheus metrics, P50/P95/P99 profiling |
Common Crawl WET Files (CC-MAIN-2025-13)
β ~300MB/file compressed, ~27K docs/file
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Ingestion Pipeline (Python) β
β WET parser β tokenize β SimHash dedup β
β β content-hash change detection β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β doc stream
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Core Index Engine (Rust / PyO3) β
β β
β ββββββββββββ βββββββββββββββββββ ββββββββββββββββββββββ β
β β WAL βββββΆβ MemSegment βββββΆβ DiskSegment(s) β β
β β CRC32+ β β (flush@50MB) β β postings.bin β β
β β fsync β β HashMap<u32, β β terms.fst β β
β ββββββββββββ β Vec<(u32,u8)>> β β doclens.bin β β
β βββββββββββββββββββ β maxscores.bin β β
β ββββββββββ¬ββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββ€ β
β β TieredMergePolicy β β
β β K-way heap merge ββββββββββββββββ
β β tombstone compaction β β
β ββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β BM25 + WAND β β HNSW Graph β β
β β DAAT traversal β β 2 layers, M=16, ef_c=200 β β
β β MaxScore pruning β β f16 storage, cosine similarity β β
β ββββββββββ¬ββββββββββββ ββββββββββββββββββ¬ββββββββββββββββββ β
βββββββββββββΌβββββββββββββββββββββββββββββββββΌβββββββββββββββββββββ
β β
ββββββββββββββββ¬ββββββββββββββββββ
β RRF(k=60) + freshness decay
βΌ
βββββββββββββββββββββββββββββββ
β FastAPI /search β
β Prometheus metrics β
β P50/P95/P99 profiling β
βββββββββββββββββββββββββββββββ
Each segment is a directory seg_{id}/ containing:
seg_{id}/
βββ postings.bin β concatenated VByte-encoded postings lists
βββ terms.fst β FST mapping term_str β (term_id: u32, offset: u64, doc_freq: u32)
βββ doclens.bin β u32[] doc lengths, indexed by segment-local doc_id
βββ docmeta.bin β 28-byte fixed-size records per doc (see below)
βββ docurls.bin β variable-length URL strings (referenced by docmeta)
βββ maxscores.bin β f32[] per-term WAND upper bounds, indexed by term_id
βββ meta.json β segment statistics + merge generation
Postings entry (per term in postings.bin):
[doc_freq: VByte][doc_id_gapβ: VByte][tfβ: VByte][doc_id_gapβ: VByte][tfβ: VByte]...
DocMeta record (28 bytes, fixed):
offset size field
0 8 url_offset β byte offset into docurls.bin
8 2 url_len β URL length in bytes
10 8 crawl_ts β Unix timestamp of crawl (seconds)
18 8 content_hash β xxHash64 of raw text
26 2 lang_id β language code (ISO 639-1 mapped to u16)
Why fixed-size DocMeta: random access by doc_id in O(1) without a secondary index. Variable-length fields (URL) live in docurls.bin with the fixed offset stored in DocMeta.
Standard BM25 DAAT scores every document containing at least one query term. For a 10-term query against 1M docs, this is millions of score computations. WAND (Broder et al., 2003) prunes this using per-term upper bounds stored in maxscores.bin.
How it works:
- For each term
t, precomputemax_score[t]=idf(t) Γ BM25(max_tf_in_segment, min_doclen_in_segment, k1=1.2, b=0.75). Store inmaxscores.binat index time. - Sort query terms by current posting list pointer ascending.
- Find "pivot": first term
iwhere the prefix sum ofmax_score[0..=i] β₯ threshold(the K-th best score seen so far, initially 0). - If all terms before the pivot share the same current doc β score it. Else β advance lagging terms to the pivot's doc.
- Scored documents that beat the threshold update it.
Measured effect on this corpus: WAND reduces scored documents by 70β85% on 5-term queries, with zero impact on result correctness. Numbers are in benchmarks/wand_ablation.py.
Write amplification for a document indexed into this engine: O(log_{merge_factor}(N)) where merge_factor is configurable (default: 10). For 1M docs across 100 initial flush segments, a document is rewritten approximately 3 times before reaching the final large segment.
Trigger condition: when tier T has more than max_per_tier segments (default: 10) and the smallest segments total less than floor_size (default: 2MB), merge them. The invariant maintained: no tier has segments whose sizes differ by more than 10Γ.
This matches Lucene's TieredMergePolicy exactly. The difference: Lucene's implementation is concurrent; this one runs merge in a background Rayon thread pool to avoid blocking writes.
Run
python -m quartz.bench.run_allto reproduce. All numbers on a 2023 MacBook Pro M2 (16GB RAM), 1M docs from CC-MAIN-2025-13, 10 WET files.
BEIR evaluation (SciFact, test split, nDCG@10):
| Configuration | SciFact | NFCorpus |
|---|---|---|
| BM25 only | run to fill | run to fill |
| BM25 + HNSW (RRF) | run to fill | run to fill |
| BM25 + HNSW + freshness | run to fill | run to fill |
Query latency (1000 queries, k=10):
| Metric | BM25 only | Hybrid |
|---|---|---|
| P50 | 0.73ms | TBD |
| P95 | 1.82ms | TBD |
| P99 | 3.19ms | TBD |
Index stats (1M docs):
| Metric | Value |
|---|---|
| Index size on disk | GB |
| Unique terms | M |
| Avg segment count (post-merge) | N |
| HNSW graph build time | min |
| Throughput (ingest) | docs/sec |
Fill these by running make bench after ingesting your corpus. Do not fabricate numbers.
The merge loop, posting list decode, WAND scoring loop, and HNSW traversal are written in Rust. The rationale is not "Rust is fast" (Python with numpy is fast enough for many things). The specific reasons:
- No GC pauses in merge: merging 100M postings lists with a GC that might pause for 10β200ms at any point corrupts P99 latency guarantees. Rust's ownership model means no GC by design.
- Memory layout control: posting lists are cache-line-aligned byte arrays. Rust lets you control exactly how they're laid out; Python objects carry 56+ bytes of overhead per object regardless of content.
unsafefor mmap access: reading mmap'd postings lists requires raw pointer arithmetic. Rust makes this possible withunsafeblocks that are auditable. Python's ctypes alternative is slower and less ergonomic.
Python handles ingestion orchestration, BEIR evaluation, and serving β all of which benefit from the ML ecosystem (sentence-transformers, beir, FastAPI) and are not in the critical query path.
Elias-Fano is asymptotically optimal for sparse integer sets and supports O(1) random access within a postings list. VByte achieves ~70% of EF's compression with simpler sequential decode and no random-access overhead. For BM25 DAAT traversal (which is sequential by design), VByte wins in practice. This project uses VByte. A production system serving extremely high QPS would consider Elias-Fano with SIMD-accelerated decode (e.g., the streamvbyte approach).
At 10M+ unique web terms, a HashMap<String, u64> in Rust requires ~600MB+. A Finite State Transducer (using the fst crate) achieves prefix and suffix sharing across sorted term strings, compressing the same dictionary to ~40β80MB. FST also supports prefix iteration for autocomplete-style queries at no additional cost.
LSM trees (RocksDB, LevelDB) optimize write throughput at the cost of read amplification β readers must check multiple levels. Inverted indexes are write-once, read-many. Tiered merge optimizes for read amplification: after merging, a query touches 1 large segment instead of N small ones. Writes pay the amortized merge cost; reads don't.
The DiskSegment reader does not implement a buffer pool. It uses mmap via the memmap2 crate. The OS page cache becomes the buffer pool: recently accessed pages are kept in RAM automatically, cold pages are evicted under memory pressure. This lets the index gracefully handle sizes larger than available RAM while benefiting from OS-level read-ahead and page sharing between processes. This is exactly what Lucene's MMapDirectory does.
768-dimension f32 embeddings for 1M documents occupy 2.9GB β too large for RAM on a 16GB laptop once the rest of the process is accounted for. f16 halves this to 1.4GB. Cosine similarity degradation from f32βf16 quantization is <0.15% at 768 dimensions (measured on the BEIR SciFact dev set). The half crate provides fast f16βf32 conversion with SIMD support on ARM and x86.
The parameter k=60 was empirically validated by Cormack, Clarke, and Buettcher (2009) across TREC ad-hoc retrieval tasks. It controls the discount applied to lower-ranked documents: smaller k emphasizes top-ranked documents more aggressively. k=60 is robust across diverse retrieval tasks without per-query tuning. This project uses it unmodified; production systems tune k per query type using held-out validation data.
Adding freshness as an additive term to BM25 scores risks elevating very-fresh-but-irrelevant documents above relevant ones, since BM25 scores vary by corpus. Multiplicative freshness (final_score = rrf_score Γ freshness_factor) scales with retrieval quality: a highly relevant doc (high RRF score) gets a larger absolute freshness boost than a marginally relevant doc with the same crawl timestamp. Decay: exp(-age_hours / (decay_days Γ 24)).
| Bottleneck | Why it breaks | Fix |
|---|---|---|
| Single-node posting list scan | Term like "the" has 500M+ postings; scanning takes seconds | Shard by term hash across nodes; each shard handles a range of term_ids |
| In-RAM HNSW graph | 1B Γ 768 f16 = 1.4TB | Partition by cluster centroid; each partition's HNSW fits in RAM |
| WAL replay on crash | Replaying 10M+ WAL entries takes minutes | WAL compaction checkpoints every N MB; replay only since last checkpoint |
| FST build during merge | Building a 500M-term FST takes hours | Incremental FST construction; store partial FSTs, merge lazily |
| Single-threaded merge | Merge of 10B postings entries takes days | Partitioned merge: split by term_id range, merge in parallel on separate cores |
This table is the honest answer to "would this scale to production?" It's more useful than inflated claims.
# 1. Prerequisites
curl https://sh.rustup.rs -sSf | sh # Rust 1.83+
pip install maturin uv
# 2. Clone and build
git clone https://github.com/YOUR_USERNAME/quartz
cd quartz
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
maturin develop --release
# 3. Download WET files from Common Crawl (e.g. 10 files = ~3GB)
python -m quartz.ingest.download --crawl CC-MAIN-2025-13 --n-files 10
# 4. Ingest and Build Index
python -m quartz.ingest.pipeline --data-dir data/wet/ --index-dir data/index/
# 5. Build HNSW Dense Retrieval Graph
python -m quartz.hnsw.build --index-dir data/index/ --hnsw-dir data/hnsw/
# 6. Run Tests & Benchmarks
make test
make bench
# 7. Serve FastAPI Endpoints
uvicorn quartz.serve.api:app --port 8000
# Query (BM25 or Hybrid)
curl "http://localhost:8000/search?q=machine+learning&mode=hybrid&k=10"quartz/
βββ quartz-core/ β Rust crate (PyO3 extension)
β βββ src/
β β βββ lib.rs β PyO3 module definition
β β βββ codec.rs β VByte encode/decode + tests
β β βββ segment.rs β MemSegment, DiskSegment, mmap reads
β β βββ merge.rs β TieredMergePolicy + K-way merge
β β βββ wal.rs β WAL write/replay/checkpoint
β β βββ bm25.rs β BM25Scorer + WAND
β β βββ hnsw.rs β HNSW graph build + search
β β βββ simhash.rs β SimHash + band dedup
β βββ Cargo.toml
βββ quartz/ β Python package
β βββ ingest/
β β βββ wet_reader.py β WET streaming parser
β β βββ pipeline.py β ingestion orchestrator
β β βββ download.py β CC file downloader
β βββ serve/
β β βββ api.py β FastAPI app
β βββ eval/
β βββ beir_runner.py β BEIR harness
β βββ profiler.py β latency profiler
βββ benchmarks/
β βββ wand_ablation.py β WAND vs naive DAAT comparison
β βββ merge_amplification.py β write amplification measurement
β βββ ef_search_pareto.py β recall vs latency Pareto curve
βββ tests/
β βββ test_codec.py
β βββ test_segment.py
β βββ test_merge.py
β βββ test_e2e.py
βββ Cargo.toml β workspace root
βββ pyproject.toml
βββ Makefile
βββ docker/
βββ Dockerfile
βββ compose.yml
You can run all benchmarks at once using make bench, or run them individually:
# BEIR SciFact
python -m quartz.eval.beir_runner --dataset scifact --index data/index/ --output results/
# BEIR NFCorpus
python -m quartz.eval.beir_runner --dataset nfcorpus --index data/index/ --output results/
# Latency profiling (1000 queries, BM25)
python -m quartz.eval.profiler --index data/index/ --n-queries 1000
# WAND early termination ablation
python -m benchmarks.wand_ablation --index data/index/ --n-queries 500
# Tiered merge write amplification factor
python -m benchmarks.merge_amplification --index data/index/
# ef_search Pareto curve (HNSW recall vs latency at different ef values)
python -m benchmarks.ef_search_pareto --hnsw-dir data/hnsw/ --n-queries 200.github/workflows/ci.yml
Rust: cargo test --workspace, cargo clippy -D warnings, cargo fmt --check
Python: maturin develop --release, pytest tests/
Runs on: ubuntu-latest, macos-latest
MIT