Combined server enhancements: OpenAI API compliance, prompt caching, concurrency#21
Conversation
Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the model generates <tool_call>...</tool_call> markup during streaming, detect the tag and suppress those tokens from being sent as response.output_text.delta events. This prevents raw tool call XML from being displayed to users (e.g., in Telegram via OpenClaw). The tool call is still parsed and emitted as structured function_call events after generation completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire the existing PromptCacheState from generate.py into both /v1/responses and /chat/completions endpoints. On repeated requests, the KV cache from the previous generation is reused for matching prefix tokens, skipping redundant prefill computation. This is especially impactful for agentic workflows where the system prompt (~15K tokens) is the same across requests — only new user messages need prefilling, reducing latency from ~35s to ~2-3s on follow-up turns. Changes: - Import PromptCacheState from generate.py - Add get_prompt_cache_state() keyed by model name - Pass prompt_cache_state to all 4 generate/stream_generate call sites - Clear prompt cache on model unload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge of feature/openai-responses-api and feature/prompt-cache branches with conflict resolution and additional tests. Adds 6 prompt cache tests verifying: - prompt_cache_state passed to all 4 generate entry points (responses streaming/non-streaming, chat/completions streaming/non-streaming) - Cache state persists across requests for the same model - Cache state isolated per model name Total: 46 tests passing (31 responses + 6 prompt cache + 9 existing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…branch Merge concurrency guard and finish_reason fix into the combined testing branch alongside OpenAI Responses API and prompt caching. - asyncio.Semaphore serializes Metal GPU access (--max-concurrent-requests) - finish_reason="tool_calls" returned when tool calls detected - All 46 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_reason Add missing test coverage for issues #2, #3, #4: Prompt cache (#2): +4 tests - cache state interface validation - cleared on model unload - prefix matching logic - has correct attributes Concurrency guard (#3): +6 tests - semaphore exists and is asyncio.Semaphore - default value is 1 - respects MAX_CONCURRENT_REQUESTS env var - singleton behavior - acquire/release around generation - sequential requests both succeed finish_reason (#4): +5 tests - stop without tools - tool_calls when tools detected - stop with tools but no calls - streaming finish_reason=tool_calls - responses endpoint status=completed Total: 60 tests passing (was 46) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accept `stop` parameter (string or list of up to 4 strings) in both /v1/responses and /v1/chat/completions. Stop strings are tokenized and passed as `eos_tokens` to the generation pipeline. New helper: resolve_stop_tokens() converts stop strings to token IDs using the model's tokenizer. Adds 7 tests: endpoint integration (responses + chat/completions), unit tests for resolve_stop_tokens (single, list, None, limit to 4). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The prompt cache prefix reuse code assumed all cache layers have mx.array keys with .shape. TurboQuantKVCache stores keys as TurboQuantMSEState objects which don't support slicing. Now checks for .shape before attempting to trim, and falls back to updating just the offset for quantized cache layers. Fixes: 'TurboQuantMSEState' object has no attribute 'shape' error when prompt caching is used with --kv-bits 3.5 --kv-quant-scheme turboquant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Combined branch now includes: - Stop sequences (#5): accept `stop` param, resolve to eos_tokens - tool_choice enforcement (#6): none/auto/required/specific function - TurboQuant fix: handle quantized KV cache in prompt prefix reuse - `stop` field added to ResponsesRequest model 67 tests passing (was 60). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
add_eos_token_ids() expects strings and handles tokenization internally. Our resolve_stop_tokens was converting to int IDs which caused "can only concatenate str (not 'int') to str" errors. Renamed to resolve_stop_sequences(), returns string list directly. Also adds tool_choice enforcement and OC validation script. 67 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, #10) Final three features completing the agentic burndown: - JSON mode (#7): resolve_response_format() injects JSON instruction when response_format={"type":"json_object"}. response_format field added to ResponsesRequest. - Context tracking (#9): --max-context-tokens flag, check_context_length() rejects oversized prompts with 400 before GPU OOM. - Request cancellation (#10): --request-timeout flag, get_request_timeout() getter. Streaming generators already handle disconnect via finally blocks. 79 tests passing (was 67). All 10 burndown issues complete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Support both agent frameworks' caching patterns: - OpenClaw: sends `prompt_cache_key` per session for cache routing. Each session gets its own PromptCacheState, so the stable system prompt prefix is matched across turns within the same conversation. - Hermes: relies on stable system prompt prefix without explicit cache keys. Falls back to model-only keying, and PromptCacheState still matches the common prefix via find_prefix_length(). Changes: - get_prompt_cache_state() accepts optional cache_key parameter - Cache keyed by "model::cache_key" when key provided, else "model" - prompt_cache_key field added to ResponsesRequest - All 4 generate call sites pass cache_key from request - 4 new tests for cache key routing 83 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both OpenClaw and Hermes monitor cached token counts in API responses to track prompt cache effectiveness. This commit: - Adds cached_tokens field to GenerationResult in generate.py (populated from reused_prefix_len during prefix cache reuse) - Adds InputTokensDetails model with cached_tokens field - Returns input_tokens_details.cached_tokens in ResponseUsage - Configures OC with cacheRetention="short" and cacheTrace This enables OC to report cache hits and lets Hermes track cache effectiveness for its system_and_3 breakpoint strategy. 83 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual array slicing with the cache's own trim() method for prefix reuse in PromptCacheState. This works with both standard KVCache (which has trim()) and TurboQuantKVCache (which also has trim() that properly invalidates internal cached state). Previously, TurboQuant caches were skipped because their keys are NamedTuples not mx.arrays. Now trim(n) removes the last n tokens (generated past the prefix) regardless of cache implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove debug logging and finalize prompt cache implementation: - Skip cache save for requests < 1024 tokens (probe/capability checks that agent frameworks send before the real request). This prevents short probes from evicting the valuable cached KV state of the full system prompt. - TurboQuant-compatible prefix reuse via trim() method. Benchmark: OC agent turns go from 13s (cold) to 3.4s (warm) — 3.8x speedup on follow-up turns with 14K token system prompt. 83 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Short prefix matches on quantized KV caches (TurboQuant) can produce corrupted/repetitive output because trim() only adjusts the offset without clearing stale quantized data blocks. Now requires >= 50% of cached tokens (minimum 512) to match before reusing the cache. This prevents cache corruption from model swaps or unrelated short requests while still allowing the system prompt prefix reuse that provides the 3.9x speedup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical fixes for production deployment: - Cache reuse wrapped in try/except — broadcast_shapes errors from stale KV state now invalidate cache and retry fresh - Cache shape validation before generate_step detects mismatches early - PromptCacheState.invalidate() + touch() + last_used/created_at - Background cleanup task evicts idle caches every 60s - --prompt-cache-ttl CLI arg (default 300s, PROMPT_CACHE_TTL env var) - All error messages sanitized — no raw MLX errors to clients Fixes the 13-hour-idle Telegram broadcast_shapes crash. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TurboQuantMSEState doesn't have .shape — use c.offset which works for both standard KVCache and quantized caches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3.5-35B-A3B (MoE with 3B active params) can degenerate into repetition loops during long generations. Apply a server-side default repetition_penalty=1.1 when the request doesn't specify one. Configurable via DEFAULT_REPETITION_PENALTY env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The OpenAI Responses API sends tools in flat format (name/description at top level) while Chat Completions uses a nested format (wrapped in a "function" key). Jinja chat templates (e.g. Gemma 4) expect the nested format and fail with "'dict object' has no attribute 'function'" when receiving flat-format tools. Normalize tools to the nested format before passing to the template. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…igation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The upstream DEFAULT_MAX_TOKENS (256) is too low for agentic use. When clients don't send max_tokens in the request, the server now uses --default-max-tokens (default: 256 for backwards compat, configurable via CLI flag or DEFAULT_MAX_TOKENS env var). Request models now use Optional[int] with None default, resolved at endpoint entry to the server's configured default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR bundles multiple server-side enhancements to improve OpenAI API compatibility (especially the Responses API), add KV prompt-prefix caching with TTL eviction, and serialize Metal/MLX generation to avoid GPU-state corruption under concurrency.
Changes:
- Implements
/v1/responsesrequest/response + SSE event schemas, addsprevious_response_idreplay via an in-memory LRU store, and expands test coverage for Responses API behavior. - Adds prompt-prefix KV caching (per model and optional
prompt_cache_key) including TTL-based eviction and TurboQuant-compatible trimming/validation. - Introduces an asyncio semaphore-based concurrency guard for generation, plus stop-sequence handling and tool-choice enforcement; updates chat finish_reason to
tool_callswhen tool calls are detected.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/validate_oc.sh | Adds a bastion/OpenClaw end-to-end validation script covering health/models, Responses/Chat, streaming, and OC agent flows. |
| scripts/benchmark_models.py | Adds a simple benchmarking script for speed + agentic behaviors (tools/instructions/code). |
| mlx_vlm/tests/test_server.py | Adds stop-sequence unit tests ensuring stop is forwarded as eos_tokens. |
| mlx_vlm/tests/test_responses_api.py | Adds comprehensive Responses API compliance + caching/concurrency/finish_reason/context/timeout tests. |
| mlx_vlm/server.py | Core implementation: Responses endpoint + SSE streaming, prompt-cache routing/eviction, semaphore guard, stop/tool_choice helpers, context length checks, new CLI/env wiring. |
| mlx_vlm/responses_store.py | Adds an LRU ResponseStore to support previous_response_id replay. |
| mlx_vlm/responses_models.py | Adds Pydantic models for Responses API requests/responses and streaming events. |
| mlx_vlm/prompt_utils.py | Normalizes flat Responses-API tool definitions into nested “chat template” tool format. |
| mlx_vlm/generate.py | Extends prompt-cache state (timestamps, invalidation, cached token accounting) and improves cache reuse/trim/validation logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Convert input to chat messages | ||
| chat_messages, images = responses_input_to_messages( | ||
| request.input, | ||
| instructions=request.instructions, | ||
| previous_response_id=request.previous_response_id, | ||
| ) |
There was a problem hiding this comment.
response_format is accepted on ResponsesRequest, but it is never applied to chat_messages before templating. As a result, JSON mode requests ({"type":"json_object"}) won’t actually inject the required system instruction. Call resolve_response_format(chat_messages, request.response_format) (or equivalent) before apply_chat_template (and ideally after tool_choice instruction insertion) so the endpoint behavior matches the exposed API surface and tests.
| def get_request_timeout() -> int: | ||
| """Maximum seconds for a generation request. 0 = no timeout.""" | ||
| return int(os.environ.get("REQUEST_TIMEOUT", 300)) |
There was a problem hiding this comment.
get_request_timeout() / REQUEST_TIMEOUT is defined and wired to CLI/env, but it is not used to enforce timeouts around generate() / stream_generate(). This means the advertised request cancellation/timeout feature is currently a no-op. Apply the timeout (e.g., asyncio.wait_for around generation, and/or enforce in the streaming loop) so requests are actually cancelled when exceeding the configured limit.
| finally: | ||
| mx.clear_cache() | ||
| gc.collect() | ||
| sem.release() | ||
| print("Stream finished, cleared cache.") |
There was a problem hiding this comment.
In stream_responses_generator(), sem is created/acquired inside the try, but the finally unconditionally calls sem.release(). If an exception occurs before sem is assigned/acquired (e.g., during event construction/yielding), this will raise UnboundLocalError or over-release the semaphore, masking the original error and potentially breaking concurrency limits. Initialize sem=None + an acquired flag (or use an async context manager) and only release when successfully acquired.
| finally: | ||
| mx.clear_cache() | ||
| gc.collect() | ||
| sem.release() | ||
| print("Stream finished, cleared cache.") |
There was a problem hiding this comment.
sem.release() is unconditional in the streaming chat_completions_endpoint generator. If await sem.acquire() fails or the generator exits early before acquiring, this can over-release the semaphore and corrupt concurrency accounting. Track acquisition state (or use an async context manager) and only release when acquisition succeeded.
| ResponseOutputMessageContentList, | ||
| ] | ||
| ] = Field(None, description="Content of the message.") | ||
| tool_calls: List = [] |
There was a problem hiding this comment.
This Pydantic model uses a mutable default (tool_calls: List = []). Mutable defaults can be shared across instances, causing cross-request/state leakage. Use Field(default_factory=list) (and similarly for other list defaults) to ensure each instance gets its own list.
| tool_calls: List = [] | |
| tool_calls: List = Field(default_factory=list) |
| type: Literal["message"] = "message" | ||
| role: Literal["assistant"] = "assistant" | ||
| status: Literal["in_progress", "completed"] = "completed" | ||
| content: List[ContentPartOutputText] = [] |
There was a problem hiding this comment.
This Pydantic model uses a mutable default (content: List[ContentPartOutputText] = []). Use Field(default_factory=list) to avoid shared mutable defaults across instances.
| content: List[ContentPartOutputText] = [] | |
| content: List[ContentPartOutputText] = Field(default_factory=list) |
| printf " %-55s " "$name" | ||
|
|
||
| local output | ||
| if ! output=$(eval "$cmd" 2>&1); then |
There was a problem hiding this comment.
eval is used to execute the constructed command string. Since HOST comes from user input and is interpolated into commands, this enables shell injection if a caller passes a malicious host value. Prefer avoiding eval by using arrays/positional args (or at least robust shell-quoting) when invoking ssh/curl.
| return | ||
| fi | ||
|
|
||
| if echo "$output" | eval "$check" > /dev/null 2>&1; then |
There was a problem hiding this comment.
eval is also used for the check expression. Even if checks are currently constant strings, using eval here makes the function fragile and harder to reason about; it can also accidentally interpret output as shell syntax. Prefer explicit check commands (e.g., call python3 -c ... directly) without eval, or pass a callable function name instead of code.
| if echo "$output" | eval "$check" > /dev/null 2>&1; then | |
| if printf '%s\n' "$output" | grep -Fq -- "$check"; then |
| if __name__ == "__main__": | ||
| models = MODELS | ||
| if len(sys.argv) > 1 and sys.argv[1] == "--models": | ||
| models = sys.argv[2].split(",") | ||
| run_benchmarks(models) |
There was a problem hiding this comment.
Argument parsing assumes sys.argv[2] exists whenever --models is present. Running benchmark_models.py --models without a value will raise IndexError. Consider using argparse (consistent with other scripts) or at least guard len(sys.argv) > 2 and provide a helpful error message.
| # Start prompt cache cleanup task | ||
| ttl = get_prompt_cache_ttl() | ||
| cleanup_task = None | ||
| if ttl > 0: | ||
| cleanup_task = asyncio.create_task(_prompt_cache_cleanup_loop()) | ||
| print(f"[prompt_cache] Cleanup task started (TTL={ttl}s, check every 60s)") | ||
|
|
||
| yield | ||
|
|
||
| # Shutdown | ||
| if cleanup_task is not None: | ||
| cleanup_task.cancel() | ||
| unload_model_sync() |
There was a problem hiding this comment.
The background cleanup_task is cancelled on shutdown but never awaited/joined. Depending on the event loop/uvicorn shutdown behavior, this can produce “Task was destroyed but it is pending” warnings and skip cleanup exception handling. After cleanup_task.cancel(), consider await cleanup_task (suppressing asyncio.CancelledError) to ensure a clean shutdown.
- Wire up --request-timeout to semaphore acquisition with 503 on timeout - Move semaphore acquisition before first yield in streaming responses - Add maxsize (64) to _prompt_cache_states with LRU eviction - Fix TOCTOU race in cache state lookup with setdefault() - Wire up resolve_response_format in /v1/responses (JSON mode was dead) - Gate all debug prints behind --verbose flag to prevent PII leakage - Add logprobs support to /chat/completions (streaming + non-streaming) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Guard semaphore release with None check to prevent UnboundLocalError and over-release on early exit (all 4 generation paths) - Use Field(default_factory=list) for mutable defaults in Pydantic models (tool_calls, annotations, content) to prevent cross-instance leakage - Replace eval with bash -c in validate_oc.sh to prevent shell injection - Use argparse in benchmark_models.py to prevent IndexError on --models - Await cleanup_task on shutdown to suppress "Task destroyed" warnings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Shutdown | ||
| if cleanup_task is not None: | ||
| cleanup_task.cancel() | ||
| try: |
There was a problem hiding this comment.
The prompt cache cleanup task is cancelled on shutdown but never awaited/handled. This can leave a pending task and produce "Task was destroyed but it is pending" warnings. Consider cancelling and then awaiting the task (suppressing asyncio.CancelledError) so shutdown is clean.
| def get_request_timeout() -> int: | ||
| """Maximum seconds for a generation request. 0 = no timeout.""" | ||
| return int(os.environ.get("REQUEST_TIMEOUT", 300)) |
There was a problem hiding this comment.
REQUEST_TIMEOUT is documented as a per-request generation timeout, but the only enforcement is in acquire_semaphore() while waiting for the GPU semaphore. If the intent is request cancellation for long generations, you’ll need to apply the timeout around generate()/stream_generate iteration as well; otherwise consider renaming/rewording this setting to reflect it only limits queue wait time.
| usage_stats = { | ||
| "input_tokens": chunk.prompt_tokens, | ||
| "output_tokens": chunk.generation_tokens, | ||
| } | ||
|
|
There was a problem hiding this comment.
Once a tool call start tag is detected, the streaming loop sets in_tool_call=True and then stops emitting any further text deltas permanently. If the model emits normal assistant text after the tool-call block, that suffix will be dropped from streaming output and from the final message (since display_text is derived from visible_text). Consider tracking tool_call_end and resuming text streaming after the tool-call region, or computing the final message text from process_tool_calls(full_text)['remaining_text'] to keep streaming/non-streaming behavior consistent.
| ge=0, | ||
| le=20, |
There was a problem hiding this comment.
top_logprobs is accepted in the request model but never used anywhere in the endpoint implementation. Either implement returning the top-N alternatives per token when top_logprobs is set, or reject/ignore the field explicitly to avoid an API that silently does something different than callers expect.
| ge=0, | |
| le=20, | |
| @field_validator("top_logprobs") | |
| @classmethod | |
| def validate_top_logprobs_supported(cls, value): | |
| if value is not None: | |
| raise ValueError( | |
| "top_logprobs is not supported by this server and must be omitted." | |
| ) | |
| return value |
| if ! output=$(bash -c "$cmd" 2>&1); then | ||
| printf "${RED}FAIL${NC} (command error)\n" | ||
| FAIL=$((FAIL + 1)) | ||
| RESULTS+=("FAIL: $name — command error") | ||
| return |
There was a problem hiding this comment.
This script evaluates command strings with eval, which makes the script fragile and opens command-injection footguns if variables like HOST or cmd contents ever become user-controlled. Prefer executing commands without eval (e.g., using bash arrays / direct pipelines) so quoting is correct and inputs can’t unexpectedly execute code.
| if printf '%s\n' "$output" | bash -c "$check" > /dev/null 2>&1; then | ||
| printf "${GREEN}PASS${NC}\n" | ||
| PASS=$((PASS + 1)) | ||
| RESULTS+=("PASS: $name") | ||
| else |
There was a problem hiding this comment.
The check is also executed via eval, which can execute arbitrary code if the check expression is modified or receives unexpected input. Consider making checks explicit commands/functions (or passing a command array) instead of eval'ing a string.
| class ContentPartOutputText(BaseModel): | ||
| """A text content part in an output message.""" | ||
|
|
||
| type: Literal["output_text"] = "output_text" | ||
| text: str = "" | ||
| annotations: List[str] = Field(default_factory=list) | ||
|
|
There was a problem hiding this comment.
Mutable list default for annotations can be shared between instances. Prefer Field(default_factory=list) for annotations (and other list fields) to avoid cross-instance mutation bugs.
| class ResponseMessageItem(BaseModel): | ||
| """An assistant message output item.""" | ||
|
|
||
| id: str = Field(default_factory=lambda: f"msg_{uuid.uuid4().hex[:24]}") | ||
| type: Literal["message"] = "message" | ||
| role: Literal["assistant"] = "assistant" | ||
| status: Literal["in_progress", "completed"] = "completed" | ||
| content: List[ContentPartOutputText] = Field(default_factory=list) | ||
|
|
There was a problem hiding this comment.
ResponseMessageItem.content uses a mutable list default. Use Field(default_factory=list) so each ResponseMessageItem has an independent content list.
| """Tests for the OpenAI Responses API (/v1/responses) compliance. | ||
|
|
||
| Covers: | ||
| A. Model validation (pure unit tests, no server/mlx needed) | ||
| B. Response store (pure unit tests) | ||
| C. Functional endpoint tests (TestClient, mocked model) | ||
| D. Streaming endpoint tests (TestClient, mocked model) |
There was a problem hiding this comment.
Module docstring says the suite covers sections A–D, but the file now includes additional sections (Prompt Cache, Concurrency Guard, finish_reason, JSON mode, etc.). Update the docstring to reflect the expanded scope so it stays accurate.
| """Tests for the OpenAI Responses API (/v1/responses) compliance. | |
| Covers: | |
| A. Model validation (pure unit tests, no server/mlx needed) | |
| B. Response store (pure unit tests) | |
| C. Functional endpoint tests (TestClient, mocked model) | |
| D. Streaming endpoint tests (TestClient, mocked model) | |
| """Tests for OpenAI Responses API (/v1/responses) compliance. | |
| Covers the core test sections in this module, including: | |
| A. Model validation (pure unit tests, no server/mlx needed) | |
| B. Response store (pure unit tests) | |
| C. Functional endpoint tests (TestClient, mocked model) | |
| D. Streaming endpoint tests (TestClient, mocked model) | |
| Also covers additional Responses API behaviors and regressions exercised | |
| elsewhere in this file, such as prompt cache handling, concurrency guards, | |
| finish_reason behavior, JSON mode, and related compatibility checks. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _verbose = _is_verbose() | ||
|
|
||
|
|
There was a problem hiding this comment.
_verbose is computed once at import time. Because main() sets os.environ["VERBOSE"] only after the module has been imported, running the server with --verbose will still leave _verbose as False, so the new debug logging paths won’t activate. Consider either reading verbosity dynamically (call _is_verbose() at use sites) or updating the module-level _verbose after setting the env var in main() (or before any logging decisions are made).
| _verbose = _is_verbose() | |
| class _VerboseFlag: | |
| def __bool__(self) -> bool: | |
| return _is_verbose() | |
| _verbose = _VerboseFlag() |
|
|
||
|
|
There was a problem hiding this comment.
top_logprobs is accepted on ChatRequest but never used when constructing the response. As-is, clients can request top_logprobs and get no effect, which is a silent API contract mismatch. Either implement top_logprobs (return top-N alternatives per token) or reject/validate unsupported combinations (e.g., return 400 if top_logprobs is set without full support).
| @field_validator("top_logprobs") | |
| @classmethod | |
| def validate_top_logprobs_supported(cls, value): | |
| if value is not None: | |
| raise ValueError( | |
| "`top_logprobs` is not supported by this server and must be omitted." | |
| ) | |
| return value |
| # Suppress tool call tokens from being streamed as text | ||
| if not in_tool_call and tool_call_start_tag in full_text: | ||
| in_tool_call = True | ||
| if in_tool_call: | ||
| continue | ||
|
|
There was a problem hiding this comment.
In /responses streaming, once tool_call_start_tag appears you set in_tool_call=True and then continue for all subsequent chunks, but there’s no logic to detect tool_call_end and resume streaming any remaining assistant text after the tool call markup. This can truncate streamed output_text compared to non-streaming mode (which uses process_tool_calls(...).remaining_text). Consider buffering and stripping tool-call spans (start→end) or toggling in_tool_call back to False when the end tag is seen so non-tool text after the call is still emitted.
| status = "incomplete" if is_length else "completed" | ||
|
|
||
| # Use visible_text (sans tool call markup) for text events | ||
| display_text = visible_text.strip() |
There was a problem hiding this comment.
display_text = visible_text.strip() changes the model output by removing leading/trailing whitespace, which can make the final response.output_text.done / content_part.done text disagree with the streamed deltas (and with non-streaming output). It’s safer to preserve the exact accumulated text (no .strip()), and only post-process tool markup if needed.
| display_text = visible_text.strip() | |
| display_text = visible_text |
| mx.clear_cache() | ||
| gc.collect() | ||
| print("Generation finished, cleared cache.") | ||
|
|
There was a problem hiding this comment.
This unconditional print("Generation finished, cleared cache.") will emit a line per non-streaming /responses request, which can be noisy in production and adds overhead under load. Consider gating this behind the existing verbosity flag (or using structured logging with levels).
| except Exception as e: | ||
| # Cache reuse failed (e.g., shape mismatch, stale KV state). | ||
| # Invalidate and fall back to fresh generation. | ||
| print(f"[prompt_cache] Cache reuse failed, invalidating: {e}") | ||
| prompt_cache_state.invalidate() | ||
| reused_prefix_len = 0 |
There was a problem hiding this comment.
stream_generate() prints cache reuse/validation failures unconditionally. In production these paths can be hit during normal operation (TTL eviction, model reloads, cache mismatches), potentially spamming stdout and making logs hard to use. Consider routing these through the server’s verbosity/logging mechanism (e.g., logger.debug) or making them conditional on an explicit debug flag.
- Make _verbose a lazy flag that checks env var on each access, so --verbose CLI flag works after module import - Reject top_logprobs with validator (not implemented, was silently ignored) - Remove .strip() on streaming display_text to match streamed deltas - Gate remaining ungated prints behind _verbose (server.py + generate.py) - Add import os to generate.py for verbose env check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Combined branch with all server enhancements developed on bastion. Consolidates 13 feature branches and 2 review passes (Claude Code + Copilot) into a single integration PR.
Features (issues → PRs)
Additional features (no issue/PR yet)
feature/default-max-tokens) —--default-max-tokensCLI flagfeature/default-repetition-penalty) — MoE degeneration preventionfeature/tool-template-compat) — Normalize Responses API tool format for Jinja templatesProduction fixes in combined branch
trim()for prefix reusesetdefault()Review findings addressed
Claude Code review + security scan:
--request-timeoutto semaphore acquisition (was dead code)--verboseflag to gate debug prints (PII leakage)resolve_response_formatin/v1/responses(JSON mode was dead)Copilot review:
sem.release()against UnboundLocalError/over-release (all 4 paths)Field(default_factory=list)evalwithbash -cin validate_oc.shargparseTest plan
🤖 Generated with Claude Code