fix(server): record TTFT for non-streaming completions + align /metrics/cache docs (#141) by Pushkinist · Pull Request #142 · Pushkinist/rMLX

Pushkinist · 2026-06-18T14:04:23Z

Summary

Fixes #141.

GET /metrics/cache had two problems:

.ttft ring empty for non-streaming completions. Root cause was a recording gap: both non-streaming paths (generate_blocking, OpenAI + Anthropic) measured TTFT and wrote it to SQLite, but never pushed it into the in-memory ttft_store ring — only generate_streaming did. So the /metrics/cache ttft array stayed [] no matter how many non-streaming requests ran. Fix: thread ttft_store: &TtftStore into both blocking paths and push the first-token sample, mirroring the streaming path (gated on first-token only, never on the error/empty-stream path).
Docs diverged from the real response shape. docs/SERVER.md promised prompt_cache / last_itl keys that the endpoint never emitted. Rewrote the GET /metrics/cache section to document the actual shape: models[], itl, ttft, tokens_in, tokens_out, error_counts.

Changes

crates/rmlx-server/src/openai/generate.rs — ttft_store param + ring write at first-token in generate_blocking.
crates/rmlx-server/src/openai/chat.rs — pass &state.ttft_store at call site.
crates/rmlx-server/src/anthropic/blocking.rs — same ring write for Anthropic non-streaming.
crates/rmlx-server/src/anthropic/route.rs — pass &state.ttft_store at call site.
crates/rmlx-server/src/openai/generate_tests.rs — new sibling tests (ring populated on success, untouched on error, capacity eviction).
docs/SERVER.md — GET /metrics/cache section rewritten to the real shape.

Verification

make lint — clean (--workspace --all-targets --all-features -D warnings).
cargo test -p rmlx-server — 551 passed (+3 new), 19 ignored.
Real-model proof (gemma-4-e2b-it-mxfp8, live serve): after 2 non-streaming + 1 streaming completion, /metrics/cache .ttft length = 3 (was 0 for non-streaming before this fix); top-level keys exactly {models, itl, ttft, tokens_in, tokens_out, error_counts} — match docs, no stray/missing keys.

🤖 Generated with Claude Code

…etrics/cache docs to actual shape The TTFT rolling ring-buffer (ttft_store) was only written by the streaming generation paths. Non-streaming completions (generate_blocking in both OpenAI and Anthropic routes) measured TTFT and emitted it to SQLite, but never wrote to the in-memory ring that GET /metrics/cache reads. Result: ttft array was always empty after non-streaming requests. Fix: add ttft_store as a parameter to both generate_blocking functions and write the ring at first-token time, mirroring the streaming path exactly. Also rewrites the GET /metrics/cache section in docs/SERVER.md to match the actual response shape: .models[] (not .prompt_cache), .itl (not .last_itl), per-model cache fields documented with types, ttft described as populated by both streaming and non-streaming paths, annotated example response included. Tests: three new unit tests in generate_tests.rs covering ring population, no write on engine error, and ring capacity eviction.

Pushkinist self-assigned this Jun 18, 2026

Pushkinist merged commit 943f999 into main Jun 18, 2026
2 checks passed

Pushkinist deleted the fix/141-metrics-cache-docs branch June 18, 2026 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): record TTFT for non-streaming completions + align /metrics/cache docs (#141)#142

fix(server): record TTFT for non-streaming completions + align /metrics/cache docs (#141)#142
Pushkinist merged 1 commit into
mainfrom
fix/141-metrics-cache-docs

Pushkinist commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pushkinist commented Jun 18, 2026

Summary

Changes

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant