Skip to content

goodbaes/tscm

Repository files navigation

TSCM — Topic-Sharded Context Manager

Give each conversation topic its own KV cache slot in llama.cpp. No topic mixing, no context overflow, flat TTFT regardless of how many topics are active.

RTX 3090, UnslopNemo 12B Q6_K, 40 parallel topics:

TTFT (cache hit): 32–36 ms  — constant across all topics, all cycles
Context overflows: 0         — each topic isolated in its own 4096-token slot
VRAM: 24011 MB / 24576 MB   — 40 × 335 MB/slot + 10823 MB model weights

The problem

Local LLMs fail in multi-topic sessions in two ways:

Naive (shared context): all topics accumulate in one slot. With -c 4096, context overflows at message ~55 → HTTP 400, message lost, full reset. Repeat every 56 messages.

llama.cpp prefix cache (single slot, per-topic contexts): smarter than expected — stores all topic histories as a radix tree until budget is exceeded. But once total history exceeds the slot budget, LRU evictions cause unpredictable latency spikes:

msg 68  fitness    57 ms   prompt_n 124   ← evicted from cache
msg 80  python     80 ms   prompt_n 189   ← evicted
msg 108 fitness    90 ms   prompt_n 203   ← evicted again

Multislot (TSCM): pin each topic to a fixed slot_id. Dedicated RAM, no evictions, no shared LRU, no cross-topic contamination.


Results

All measurements on RTX 3090 / llama.cpp b9682 CUDA 12.4 / UnslopNemo 12B Q6_K (Mistral NeMo) / 40 topics × 160 messages.

Approach TTFT typical Crisis behavior Topic isolation
Multislot 32–36 ms flat none ✅ dedicated slot
Naive_pertopic (shared slot) 14–37 ms 57–106 ms LRU spikes
Naive (shared context) 33–45 ms HTTP 400, msg lost

Multislot is not faster on average — it is deterministic. TTFT is guaranteed regardless of access pattern, topic count, or conversation history length.

VRAM per slot (measured)

Model ctx/slot KV dtype MB/slot Max slots (RTX 3090)
UnslopNemo 12B Q6_K 4096 q8_0 335 ~41
gemma4-fast (3.19 GB) 4096 q8_0 ~27 ~480

Formula: max_slots = (VRAM_total - model_weights_MB) / MB_per_slot

Critical flag: -c in llama-server is total context divided across slots, not per-slot:

# WRONG: -c 4096 --parallel 20  →  256 tokens/slot (b9682: 4096 ÷ 20 = 204, rounded up to 256)
# RIGHT: -c 81920 --parallel 20  →  4096 tokens/slot
llama-server -c 163840 --parallel 40  # 40 slots × 4096 tokens each

How it works

Each topic gets a fixed slot_id at first message. All KV caches live in GPU RAM simultaneously — no save/restore, no SSD I/O, no alignment issues.

topic_slots: dict[str, int] = {}  # topic → slot_id
next_slot = 0

for message in conversation:
    topic = route(message)  # or manual tagging
    if topic not in topic_slots:
        topic_slots[topic] = next_slot
        next_slot += 1

    response = llama_server.completion(
        prompt=build_prompt(topic_history[topic], message),
        slot_id=topic_slots[topic],   # ← the key line
        cache_prompt=True,
    )

The prefix cache inside each slot works perfectly — prompt_n = only new tokens, not full history.

Architecture

core/
  shard.py          — project(), prompt_for_completion(), ChatTemplate
  event_log.py      — append-only JSONL event log
  cache_manager.py  — KV save/restore for SSD fallback path
  backend.py        — llama-server HTTP client

bench/
  bench_shards.py   — naive / naive_pertopic / multislot comparison
  bench_stress.py   — VRAM sweep (how many slots fit?)
  gen_fixtures.py   — fixture generator, supports --interleaved (round-robin)
  fixtures/         — f3_stress_20.jsonl, f3_stress_40.jsonl, *_rr.jsonl

Supported chat templates

Template Model family SWA issue
mistral_nemo Mistral NeMo, UnslopNemo No — perfect prefix cache
gemma Gemma 4 Yes — ~70 token re-prefill per cycle (checkpoint gap in b9682)

Recommendation: use Mistral-architecture models for multislot. Gemma 4 has a hidden re-prefill overhead in b9682: an internal checkpoint at ~position 124 causes ~70 extra tokens to be recomputed on each new cycle. Gemma 4 also uses SWA (n_swa=512), but that's a separate mechanism — the checkpoint behavior is llama.cpp-internal.


Quick start

# 1. dependencies
uv sync

# 2. edit config.yaml — set model path and server flags
# IMPORTANT: ctx = per_slot_ctx × parallel

# 3. start llama-server (example: 20 topics × 4096 ctx on RTX 3090)
llama-server.exe \
  -m <model.gguf> \
  -c 81920 -ngl 99 --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --parallel 20 --port 8081

# 4. run stress test
uv run python -m bench.gen_fixtures 20
uv run python -m bench.bench_stress --topics 20

# 5. run naive vs multislot comparison
uv run python -m bench.bench_shards --mode multislot --fixture f3_stress_20.jsonl
uv run python -m bench.bench_shards --mode naive --fixture f3_stress_20.jsonl

Key findings

1. llama.cpp prefix cache is already smart — until it isn't. With cache_prompt=true on a single slot, llama-server stores all topic histories as a radix tree. With 40 topics × 70 tokens/exchange = 2845 tokens total, everything fits in a 4096-token slot and prompt_n = 1 (14 ms). The problem is eviction when total history exceeds budget — LRU makes it unpredictable.

2. Multislot changes "shared LRU cache" to "dedicated per-topic memory." Not faster on average. Deterministic by design.

3. Model architecture matters more than model size. Gemma 4 produces ~70 extra re-prefill tokens per cycle in b9682 (internal checkpoint at ~position 124). Mistral NeMo: cycle 2 = cycle 1 in TTFT. Choose models without sliding window attention for multislot.

4. SSD restore breakeven is non-obvious. On RTX 3090 (~2000 tok/s prefill): restore from NVMe = 260–965 ms, re-prefill 300 tokens = ~150 ms. SSD path only pays off for conversations with >600–1000 tokens of accumulated history.

5. The -c flag is total, not per-slot. -c 4096 --parallel 20 = 256 tokens/slot, not 4096. This cost an hour to debug. Always: -c = per_slot_ctx × parallel.


Open problems

  • LRU eviction policy — what happens when topic N+1 arrives and VRAM is full? Evict by recency, by context length (shorter = cheaper to re-prefill), or hybrid? Not yet implemented.
  • Router — automatic topic detection from message content. Phase 2.
  • Persistence across server restarts — multislot loses all KV on restart. SSD path survives restarts.

Hardware tested: MacBook Air M1 16 GB (Metal), RTX 3090 24 GB (CUDA 12.4). Server: llama.cpp b9682. Model: UnslopNemo-12B-v4.1 Q6_K (Mistral NeMo).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages