Skip to content

v0.4 — KV cache tiering (hot RAM + cold SSD)#26

Merged
magicnight merged 6 commits intomainfrom
feat/v0.4-kv-cache-tiering
Apr 18, 2026
Merged

v0.4 — KV cache tiering (hot RAM + cold SSD)#26
magicnight merged 6 commits intomainfrom
feat/v0.4-kv-cache-tiering

Conversation

@magicnight
Copy link
Copy Markdown
Owner

Summary

First of three v0.4.0 engine-parity sub-features (with ModelPool + MCP server to follow as separate PRs).

  • PromptCacheKey — sha256 hash over (modelID, tokens) with 16-way shard layout
  • PromptCacheStore actor — hot dict + cold safetensors LRU, Sendable snapshot wrapper
  • MLXSwiftEngine.generate() wired to check store + save updated cache post-generation — HIT/MISS debug logs under `engine` category
  • Settings → "KV Cache" section with hot/cold budget steppers + Clear All button
  • New tests: 5 PromptCacheKey cases + 4 PromptCacheStore cases (3 require Metal; skip gracefully on SPM)

MVP scope

  • Full-prompt hash (new prompt must START with old prompt as prefix to benefit) — vLLM-style block-level chained hashing for longest-common-prefix match is v0.4.0.1
  • Hot LRU is entry-count-based (8 entries). MB slider is persisted for future byte-accurate budget
  • Cold tier has no automatic pruning yet; users trigger via "Clear All"

Test plan

  • MacMLXCore: `swift build` + `swift test` (93/93, 3 skipped for metallib)
  • Xcode app: `xcodebuild -scheme macMLX -configuration Debug build`
  • Manual: load a small model, send same prompt twice → second turn shows `Prompt cache HIT` in Logs

🤖 Generated with Claude Code

5 tasks: PromptCacheKey, PromptCacheStore (hot+cold LRU), engine
wiring, Settings UI, CHANGELOG. MVP uses full-prompt hash; block-
level longest-common-prefix matching deferred to v0.4.0.1.
Actor-based two-tier prompt cache. Hot tier is an in-memory LRU dict
keyed by PromptCacheKey. Cold tier is safetensors files under
`root/<shard>/<hash>.safetensors` round-tripped via mlx-swift-lm's
savePromptCache / loadPromptCache. Eviction from hot persists to
cold; cold hits promote back into hot.

Introduces PromptCacheSnapshot — an @unchecked Sendable wrapper for
[any KVCache] so snapshots can cross the actor isolation boundary
(mlx-swift-lm's KVCache has no Sendable conformance).

Tests cover put/get hot hit, hot->cold eviction, cold->hot restore,
and miss-returns-nil. The three MLX-dependent tests skip when
default.metallib is not in the test bundle (standard SPM test
binaries often lack it); the miss-path test runs unconditionally.
On each generate call, hash the full input token sequence, look up
a prior cache snapshot, and pass it to the token iterator so the
shared prefix prefill is skipped. Save the extended snapshot after
generation completes so the next turn benefits.

MVP keys on exact-prefix match; vLLM-style block hashing with
longest-common-prefix matching is v0.4.1+.
@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 18, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@magicnight magicnight merged commit ace00ba into main Apr 18, 2026
2 checks passed
@magicnight magicnight deleted the feat/v0.4-kv-cache-tiering branch April 18, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant