A correctness-first Rust reimplementation of the BCR-memory v1 prefix cache, with regression tests for every bug we found in the original.
One-sentence summary. The flat-array DFA idea is sound; the v1 implementation had four concrete bugs that made it unsafe under sustained churn. This repo fixes them, proves each fix with a dedicated test, and documents what we learned comparing the result to production prefix caches. See TECHNICAL_NOTE.md for the long version.
| Bug in v1 | Fix in v2 | Proof |
|---|---|---|
evict only tombstones the value; state vector grew unboundedly |
Proper state + edge reclamation with cascade-evict via parent edge splicing | scripts/compare_v1_v2.py: v1 grows to 10 001 state slots over 10 k cycles; v2 stays at 2 |
| 64-bit block symbol with no collision check (upper 64 bits of Murmur3 were discarded) | Full Symbol128 { lo, hi }; both halves participate in equality |
test_128bit_hash_distinguishes_across_many_blocks — 100 k distinct blocks, 0 collisions |
| Docs claimed O(1) per block, code did O(E) linear scan | Honest complexity docs + per-state HashMap index above EDGE_SCAN_THRESHOLD = 8 |
hash_index_kicks_in_above_threshold test |
| "Lock-free" claimed; no atomics, no single-writer enforcement | Rust writer_guard + Python threading.Lock wrapper |
test_concurrent_writers_under_lock_do_not_corrupt (4 threads × 200 cycles) |
| Experiment | Result |
|---|---|
| 14 Rust unit tests + 17 Python regression tests | 31 / 31 pass |
| v1 vs v2 churn (10 000 insert/evict cycles) | v1: 10 001 state slots (leak); v2: 2 (bounded) |
| Real-LLM correctness parity (Qwen2.5-7B tokenizer, 400 multi-turn prompts) | 0 divergences from reference across 537 ops |
| 30-min soak, 8 M cache ops | RSS 785.8 → 785.9 MB, state_liveness 1.0 throughout |
| vs. Python-dict reference, cache microbench | BCR match_prefix p50 3.2 µs (Python dict: 4.9 µs) |
Full numbers in TECHNICAL_NOTE.md §2–§3.
Requires Rust 1.75+, Python 3.9+, and maturin.
cd bcr_cache
# Rust core tests (no Python interpreter linking)
PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1 cargo test --lib --no-default-features
# Build and install the Python wheel
PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1 maturin build --release
pip install --force-reinstall target/wheels/bcr_cache-*.whl
# Regression tests
pytest tests/
# The headline empirical demonstration
python scripts/compare_v1_v2.py --cycles 10000The PYO3_USE_ABI3_FORWARD_COMPATIBILITY flag is only needed on Python ≥ 3.13. We pinned
pyo3 0.20 for ABI parity with v1.
bcr_cache/
├── Cargo.toml
├── pyproject.toml
├── src/
│ ├── core.rs # pure-Rust BCRGraph (no PyO3; unit-testable standalone)
│ └── lib.rs # thin PyO3 facade over core
├── python/bcr_cache/ # threading.Lock wrapper + SGLangBCRCache adapter
├── tests/ # Python regression suite (incl. v1-bug reproductions)
└── scripts/
└── compare_v1_v2.py # side-by-side churn comparison
experiments/
├── prompts.py # synthetic chat corpus with prefix reuse
├── reference_cache.py # pure-Python reference cache for differential testing
├── exp1_correctness_parity.py
├── exp2_soak.py
├── exp3_throughput.py
├── plot_soak.py
├── serving_bench.sh # GCP/L4 real-LLM harness for SGLang + vLLM
├── results/ # all JSON + CSV from the runs described in TECHNICAL_NOTE.md
└── README.md # reproduction recipe
TECHNICAL_NOTE.md # honest write-up with the state-of-the-art comparison
CHANGELOG.md
LICENSE # dual MIT / Apache-2.0
from bcr_cache import BCRCache, SGLangBCRCache
# Direct use
c = BCRCache(block_size=16)
c.insert(token_ids, block_values) # tokens already tokenized
matched_len, matched_vals = c.match_prefix(token_ids)
c.evict(some_block_value)
c.assert_invariants() # runtime structural check
# SGLang BasePrefixCache adapter
cache = SGLangBCRCache(block_size=16)
cache.insert(token_ids, block_values)
cache.acquire([v1, v2]); cache.release([v1, v2])- Not a replacement for SGLang's RadixCache in production. On the state-of-the-art comparison we ran (Qwen2.5-7B, L4, SGLang 0.5.10), RadixCache delivers a 21× TTFT win on cache-friendly traffic and does not leak under sustained load. BCR-memory-2 would need an actual SGLang scheduler integration and multi-model validation before it could claim to compete. See TECHNICAL_NOTE.md §3.
- Not a paper. See §4 of the technical note for why we did not pursue that direction.
Dual MIT / Apache-2.0 — see LICENSE.
@software{bcr_memory_2_2026,
author = {Povilionis, Armanas},
title = {BCR-memory-2: a correctness-fixed block-compressed prefix cache},
year = {2026}
}