fix(server): record TTFT for non-streaming completions + align /metrics/cache docs (#141)#142
Merged
Merged
Conversation
…etrics/cache docs to actual shape The TTFT rolling ring-buffer (ttft_store) was only written by the streaming generation paths. Non-streaming completions (generate_blocking in both OpenAI and Anthropic routes) measured TTFT and emitted it to SQLite, but never wrote to the in-memory ring that GET /metrics/cache reads. Result: ttft array was always empty after non-streaming requests. Fix: add ttft_store as a parameter to both generate_blocking functions and write the ring at first-token time, mirroring the streaming path exactly. Also rewrites the GET /metrics/cache section in docs/SERVER.md to match the actual response shape: .models[] (not .prompt_cache), .itl (not .last_itl), per-model cache fields documented with types, ttft described as populated by both streaming and non-streaming paths, annotated example response included. Tests: three new unit tests in generate_tests.rs covering ring population, no write on engine error, and ring capacity eviction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #141.
GET /metrics/cachehad two problems:.ttftring empty for non-streaming completions. Root cause was a recording gap: both non-streaming paths (generate_blocking, OpenAI + Anthropic) measured TTFT and wrote it to SQLite, but never pushed it into the in-memoryttft_storering — onlygenerate_streamingdid. So the/metrics/cachettftarray stayed[]no matter how many non-streaming requests ran. Fix: threadttft_store: &TtftStoreinto both blocking paths and push the first-token sample, mirroring the streaming path (gated on first-token only, never on the error/empty-stream path).Docs diverged from the real response shape.
docs/SERVER.mdpromisedprompt_cache/last_itlkeys that the endpoint never emitted. Rewrote theGET /metrics/cachesection to document the actual shape:models[],itl,ttft,tokens_in,tokens_out,error_counts.Changes
crates/rmlx-server/src/openai/generate.rs—ttft_storeparam + ring write at first-token ingenerate_blocking.crates/rmlx-server/src/openai/chat.rs— pass&state.ttft_storeat call site.crates/rmlx-server/src/anthropic/blocking.rs— same ring write for Anthropic non-streaming.crates/rmlx-server/src/anthropic/route.rs— pass&state.ttft_storeat call site.crates/rmlx-server/src/openai/generate_tests.rs— new sibling tests (ring populated on success, untouched on error, capacity eviction).docs/SERVER.md—GET /metrics/cachesection rewritten to the real shape.Verification
make lint— clean (--workspace --all-targets --all-features -D warnings).cargo test -p rmlx-server— 551 passed (+3 new), 19 ignored./metrics/cache.ttftlength = 3 (was 0 for non-streaming before this fix); top-level keys exactly{models, itl, ttft, tokens_in, tokens_out, error_counts}— match docs, no stray/missing keys.🤖 Generated with Claude Code