Skip to content

fix(server): record TTFT for non-streaming completions + align /metrics/cache docs (#141)#142

Merged
Pushkinist merged 1 commit into
mainfrom
fix/141-metrics-cache-docs
Jun 18, 2026
Merged

fix(server): record TTFT for non-streaming completions + align /metrics/cache docs (#141)#142
Pushkinist merged 1 commit into
mainfrom
fix/141-metrics-cache-docs

Conversation

@Pushkinist

Copy link
Copy Markdown
Owner

Summary

Fixes #141.

GET /metrics/cache had two problems:

  1. .ttft ring empty for non-streaming completions. Root cause was a recording gap: both non-streaming paths (generate_blocking, OpenAI + Anthropic) measured TTFT and wrote it to SQLite, but never pushed it into the in-memory ttft_store ring — only generate_streaming did. So the /metrics/cache ttft array stayed [] no matter how many non-streaming requests ran. Fix: thread ttft_store: &TtftStore into both blocking paths and push the first-token sample, mirroring the streaming path (gated on first-token only, never on the error/empty-stream path).

  2. Docs diverged from the real response shape. docs/SERVER.md promised prompt_cache / last_itl keys that the endpoint never emitted. Rewrote the GET /metrics/cache section to document the actual shape: models[], itl, ttft, tokens_in, tokens_out, error_counts.

Changes

  • crates/rmlx-server/src/openai/generate.rsttft_store param + ring write at first-token in generate_blocking.
  • crates/rmlx-server/src/openai/chat.rs — pass &state.ttft_store at call site.
  • crates/rmlx-server/src/anthropic/blocking.rs — same ring write for Anthropic non-streaming.
  • crates/rmlx-server/src/anthropic/route.rs — pass &state.ttft_store at call site.
  • crates/rmlx-server/src/openai/generate_tests.rs — new sibling tests (ring populated on success, untouched on error, capacity eviction).
  • docs/SERVER.mdGET /metrics/cache section rewritten to the real shape.

Verification

  • make lint — clean (--workspace --all-targets --all-features -D warnings).
  • cargo test -p rmlx-server — 551 passed (+3 new), 19 ignored.
  • Real-model proof (gemma-4-e2b-it-mxfp8, live serve): after 2 non-streaming + 1 streaming completion, /metrics/cache .ttft length = 3 (was 0 for non-streaming before this fix); top-level keys exactly {models, itl, ttft, tokens_in, tokens_out, error_counts} — match docs, no stray/missing keys.

🤖 Generated with Claude Code

…etrics/cache docs to actual shape

The TTFT rolling ring-buffer (ttft_store) was only written by the
streaming generation paths. Non-streaming completions (generate_blocking
in both OpenAI and Anthropic routes) measured TTFT and emitted it to
SQLite, but never wrote to the in-memory ring that GET /metrics/cache reads.
Result: ttft array was always empty after non-streaming requests.

Fix: add ttft_store as a parameter to both generate_blocking functions and
write the ring at first-token time, mirroring the streaming path exactly.

Also rewrites the GET /metrics/cache section in docs/SERVER.md to match
the actual response shape: .models[] (not .prompt_cache), .itl (not
.last_itl), per-model cache fields documented with types, ttft described
as populated by both streaming and non-streaming paths, annotated example
response included.

Tests: three new unit tests in generate_tests.rs covering ring population,
no write on engine error, and ring capacity eviction.
@Pushkinist Pushkinist self-assigned this Jun 18, 2026
@Pushkinist Pushkinist merged commit 943f999 into main Jun 18, 2026
2 checks passed
@Pushkinist Pushkinist deleted the fix/141-metrics-cache-docs branch June 18, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/metrics/cache: response shape diverges from docs (.models[] not .prompt_cache; itl not last_itl); .ttft ring empty for non-streaming

1 participant