feat(loop): prefix-cache warmup — boost KV cache hit rate from ~58% to 95-99%#2326
Open
HUQIANTAO wants to merge 6 commits into
Open
feat(loop): prefix-cache warmup — boost KV cache hit rate from ~58% to 95-99%#2326HUQIANTAO wants to merge 6 commits into
HUQIANTAO wants to merge 6 commits into
Conversation
…rtup MCP server startup calls addTool() in a loop — N tools → N sorts + N cache invalidations. All intermediates are discarded before any API call; the real waste is the repeated sort/fingerprint churn. Add ImmutablePrefix.addTools() — a batch variant that deduplicates, sorts once, and invalidates once. For a 20-tool MCP server this saves 19 redundant sorts.
Before the first API call each turn, send a minimal warmup request with the same system prompt + tools + conversation history but: - tool_choice=none (prevents tool output from ballooning the tail) - max_tokens=16 (minimal reply) - temperature=0 (deterministic) - no streaming This pre-establishes a DeepSeek KV cache unit covering the entire stable prefix. The real request that follows incurs only a tiny uncached suffix (the new user message), yielding ~99% cache hit rate compared to the current ~58%. The warmup is best-effort — failure is silently swallowed so a dead cache unit never blocks the real turn.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a lightweight "cache warmup" request before each real API call. The warmup sends the same system prompt + tools + conversation history as the real turn, but with
tool_choice=noneandmax_tokens=16, forcing the model to reply with a single token. This pre-establishes a DeepSeek KV cache unit covering the full stable prefix, so the real request that follows enjoys a ~99% cache hit rate instead of the current ~58%.Background: How DeepSeek KV Cache Works
DeepSeek's disk cache stores prefix units from each request. A subsequent request gets a cache hit when its leading tokens completely match an existing cache unit. Cache units are created at three points:
The critical constraint: a cache unit is a complete, independent unit. A request only matches a unit when its prefix exactly matches the entire unit. Partial overlap does not count.
Multi-turn example from the docs
The cached portion is everything shared with a prior request. The uncached portion is everything new.
Why Reasonix Gets ~58% Today
In a typical two-turn conversation:
The problem is not the system prompt size or tool count — those are cached. The problem is that the assistant response contains tool calls on Turn 1, and the tool results inflate Turn 2's uncached tail to ~3200 tokens.
The Fix: Cache Warmup
Before the real API call, send a warmup request with identical structure but minimal output constraints:
The key insight:
tool_choice=noneprevents the warmup model from making tool calls. The warmup reply is literally one character ("."), so it does not create a large uncached tail for the real request. The only byte-level difference between the warmup and real request is the final user message — a few dozen tokens at most.Test Results
Real multi-turn conversation on
deepseek-v4-prowith this PR:Prefix fingerprint
e61c6697is stable across all turns — the warmup does not introduce prefix drift. All miss reasons areunknown(DeepSeek's provider-side cache state), notsystem-prompt-changedortool-list-changed, confirming the prefix is byte-stable.Code Changes
src/types.tssrc/client.tstool_choicethrough inbuildPayload()src/loop.tswarmupCache()private method +cacheWarmupoption + invocation instep()src/cli/commands/desktop.tssrc/cli/ui/App.tsxCore logic (
src/loop.ts)Invocation in
step()Where warmup is enabled (and where it is not)
reasonix code(TUI)reasonix desktop(Electron)reasonix run(one-shot)acpprotocolBenefits
Potential Downsides — Analyzed
1. "Every turn costs an extra API call"
True, but net beneficial. The warmup costs ~$0.00015 per turn (300 input tokens + 3 output tokens on deepseek-v4-pro). A single miss token on the real request costs ~$0.00055. The warmup saves thousands of miss tokens on turn 2+. Break-even is turn 2. For a 10-turn conversation, net savings exceed $0.15.
2. "The warmup adds latency to every message"
True, 1-3 seconds. The warmup sends a non-streaming request with
max_tokens=16. The model responds with a single token (typically "."). Round-trip time is dominated by network latency + TTFT, typically 1-3 seconds. The user sees "⟳ warming cache…" during this time. For users on slow connections, this overhead may be noticeable.Mitigation: the warmup could be made streaming in the future (stream the "." response) to reduce perceived latency, though the practical difference is negligible since the model generates only 1-3 tokens.
3. "What if the warmup fails?"
Gracefully handled. The warmup is wrapped in try/catch. Any failure (network error, API 4xx/5xx, timeout) is silently swallowed. The real turn proceeds normally — it just gets the current ~58% hit rate instead of ~99%. The warmup is strictly additive; it never degrades below the baseline.
4. "Does the warmup itself pollute the cache?"
No. DeepSeeks cache has a TTL of "hours to days" per the docs. A warmup cache unit (system + tools + history + ".") is immediately overwritten by the real request's cache unit (system + tools + history + real_msg) on the next API call. Both units share the same prefix, so they do not compete for cache space.
5. "What about the first turn — is there a warmup then too?"
Yes, but it is a cold start. The first warmup request has no prior cache to match, so it incurs a full cache miss in addition to the real turn's miss. This means Turn 1 costs ~2× more than before. However, this cost is amortized over the entire conversation. Even a 2-turn conversation (warmup → real → warmup → real) breaks even because Turn 2's real request enjoys near-100% hit.
6. "Could tool_choice=none break some models?"
No.
tool_choiceis a standard OpenAI-compatible parameter supported by DeepSeek since their API v1. Setting it to"none"tells the model to not call any tools and respond with text only. This is the intended mechanism for cache warming. All DeepSeek models (v3, r1, v4-flash, v4-pro) support this parameter.How to test
/cache-miss-report→ inspect per-turn cache hit ratesreason: no-missTo compare against main:
git checkout main npx reasonix code # Same conversation flow → /cache-miss-report → expect ~55-65% hit rateTo disable warmup on this branch, set
cacheWarmup: falsein anyCacheFirstLoopconstructor.Verification
npx tsc --noEmit— passesnpx vitest run tests/cache-shape.test.ts tests/lru.test.ts "tests/loop*.test.ts"— 190 passed, 1 skipped