Give each conversation topic its own KV cache slot in llama.cpp. No topic mixing, no context overflow, flat TTFT regardless of how many topics are active.
RTX 3090, UnslopNemo 12B Q6_K, 40 parallel topics:
TTFT (cache hit): 32–36 ms — constant across all topics, all cycles
Context overflows: 0 — each topic isolated in its own 4096-token slot
VRAM: 24011 MB / 24576 MB — 40 × 335 MB/slot + 10823 MB model weights
Local LLMs fail in multi-topic sessions in two ways:
Naive (shared context): all topics accumulate in one slot. With -c 4096, context overflows at message ~55 → HTTP 400, message lost, full reset. Repeat every 56 messages.
llama.cpp prefix cache (single slot, per-topic contexts): smarter than expected — stores all topic histories as a radix tree until budget is exceeded. But once total history exceeds the slot budget, LRU evictions cause unpredictable latency spikes:
msg 68 fitness 57 ms prompt_n 124 ← evicted from cache
msg 80 python 80 ms prompt_n 189 ← evicted
msg 108 fitness 90 ms prompt_n 203 ← evicted again
Multislot (TSCM): pin each topic to a fixed slot_id. Dedicated RAM, no evictions, no shared LRU, no cross-topic contamination.
All measurements on RTX 3090 / llama.cpp b9682 CUDA 12.4 / UnslopNemo 12B Q6_K (Mistral NeMo) / 40 topics × 160 messages.
| Approach | TTFT typical | Crisis behavior | Topic isolation |
|---|---|---|---|
| Multislot | 32–36 ms flat | none | ✅ dedicated slot |
| Naive_pertopic (shared slot) | 14–37 ms | 57–106 ms LRU spikes | ❌ |
| Naive (shared context) | 33–45 ms | HTTP 400, msg lost | ❌ |
Multislot is not faster on average — it is deterministic. TTFT is guaranteed regardless of access pattern, topic count, or conversation history length.
| Model | ctx/slot | KV dtype | MB/slot | Max slots (RTX 3090) |
|---|---|---|---|---|
| UnslopNemo 12B Q6_K | 4096 | q8_0 | 335 | ~41 |
| gemma4-fast (3.19 GB) | 4096 | q8_0 | ~27 | ~480 |
Formula: max_slots = (VRAM_total - model_weights_MB) / MB_per_slot
Critical flag: -c in llama-server is total context divided across slots, not per-slot:
# WRONG: -c 4096 --parallel 20 → 256 tokens/slot (b9682: 4096 ÷ 20 = 204, rounded up to 256)
# RIGHT: -c 81920 --parallel 20 → 4096 tokens/slot
llama-server -c 163840 --parallel 40 # 40 slots × 4096 tokens eachEach topic gets a fixed slot_id at first message. All KV caches live in GPU RAM simultaneously — no save/restore, no SSD I/O, no alignment issues.
topic_slots: dict[str, int] = {} # topic → slot_id
next_slot = 0
for message in conversation:
topic = route(message) # or manual tagging
if topic not in topic_slots:
topic_slots[topic] = next_slot
next_slot += 1
response = llama_server.completion(
prompt=build_prompt(topic_history[topic], message),
slot_id=topic_slots[topic], # ← the key line
cache_prompt=True,
)The prefix cache inside each slot works perfectly — prompt_n = only new tokens, not full history.
core/
shard.py — project(), prompt_for_completion(), ChatTemplate
event_log.py — append-only JSONL event log
cache_manager.py — KV save/restore for SSD fallback path
backend.py — llama-server HTTP client
bench/
bench_shards.py — naive / naive_pertopic / multislot comparison
bench_stress.py — VRAM sweep (how many slots fit?)
gen_fixtures.py — fixture generator, supports --interleaved (round-robin)
fixtures/ — f3_stress_20.jsonl, f3_stress_40.jsonl, *_rr.jsonl
| Template | Model family | SWA issue |
|---|---|---|
mistral_nemo |
Mistral NeMo, UnslopNemo | No — perfect prefix cache |
gemma |
Gemma 4 | Yes — ~70 token re-prefill per cycle (checkpoint gap in b9682) |
Recommendation: use Mistral-architecture models for multislot. Gemma 4 has a hidden re-prefill overhead in b9682: an internal checkpoint at ~position 124 causes ~70 extra tokens to be recomputed on each new cycle. Gemma 4 also uses SWA (n_swa=512), but that's a separate mechanism — the checkpoint behavior is llama.cpp-internal.
# 1. dependencies
uv sync
# 2. edit config.yaml — set model path and server flags
# IMPORTANT: ctx = per_slot_ctx × parallel
# 3. start llama-server (example: 20 topics × 4096 ctx on RTX 3090)
llama-server.exe \
-m <model.gguf> \
-c 81920 -ngl 99 --flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--parallel 20 --port 8081
# 4. run stress test
uv run python -m bench.gen_fixtures 20
uv run python -m bench.bench_stress --topics 20
# 5. run naive vs multislot comparison
uv run python -m bench.bench_shards --mode multislot --fixture f3_stress_20.jsonl
uv run python -m bench.bench_shards --mode naive --fixture f3_stress_20.jsonl1. llama.cpp prefix cache is already smart — until it isn't.
With cache_prompt=true on a single slot, llama-server stores all topic histories as a radix tree. With 40 topics × 70 tokens/exchange = 2845 tokens total, everything fits in a 4096-token slot and prompt_n = 1 (14 ms). The problem is eviction when total history exceeds budget — LRU makes it unpredictable.
2. Multislot changes "shared LRU cache" to "dedicated per-topic memory." Not faster on average. Deterministic by design.
3. Model architecture matters more than model size. Gemma 4 produces ~70 extra re-prefill tokens per cycle in b9682 (internal checkpoint at ~position 124). Mistral NeMo: cycle 2 = cycle 1 in TTFT. Choose models without sliding window attention for multislot.
4. SSD restore breakeven is non-obvious. On RTX 3090 (~2000 tok/s prefill): restore from NVMe = 260–965 ms, re-prefill 300 tokens = ~150 ms. SSD path only pays off for conversations with >600–1000 tokens of accumulated history.
5. The -c flag is total, not per-slot.
-c 4096 --parallel 20 = 256 tokens/slot, not 4096. This cost an hour to debug. Always: -c = per_slot_ctx × parallel.
- LRU eviction policy — what happens when topic N+1 arrives and VRAM is full? Evict by recency, by context length (shorter = cheaper to re-prefill), or hybrid? Not yet implemented.
- Router — automatic topic detection from message content. Phase 2.
- Persistence across server restarts — multislot loses all KV on restart. SSD path survives restarts.
Hardware tested: MacBook Air M1 16 GB (Metal), RTX 3090 24 GB (CUDA 12.4). Server: llama.cpp b9682. Model: UnslopNemo-12B-v4.1 Q6_K (Mistral NeMo).