forked from Blaizzy/mlx-vlm
-
Notifications
You must be signed in to change notification settings - Fork 0
Combined server enhancements: OpenAI API compliance, prompt caching, concurrency #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
1407088
feat: OpenAI Responses API compliance with tool calling support
eloe 0739cb7
fix: suppress tool call tokens from streaming text deltas
eloe f3fa553
feat: add prompt prefix caching to server endpoints
eloe a4048eb
feat: combined OpenAI Responses + prompt prefix caching
eloe 19a2860
feat: add concurrency guard and finish_reason=tool_calls to combined …
eloe 5e66a84
test: comprehensive tests for prompt cache, concurrency guard, finish…
eloe 0b954c9
feat: add stop sequences support for both endpoints
eloe 7227505
fix: handle TurboQuant KV cache in prompt cache trimming
eloe f9e008e
Merge feature/stop-sequences into combined-bastion
eloe a9bdcf1
feat: add stop sequences, tool_choice, and TurboQuant cache fix
eloe 415da71
fix: stop sequences pass strings not token IDs to stopping criteria
eloe 4ee6bf0
feat: add JSON mode, context tracking, and request cancellation (#7, …
eloe 949bdf8
feat: prompt cache key routing for OpenClaw and Hermes compatibility
eloe 318c962
feat: report cached_tokens in usage for OC/Hermes prompt caching
eloe 12116f2
fix: use trim() for KV cache prefix reuse (TurboQuant compatible)
eloe ef8e0c3
debug: add prompt cache logging
eloe 5712dd0
debug: log token mismatch details
eloe 2853726
fix: skip cache save for short probe requests (<1024 tokens)
eloe 873ad47
feat: production-ready prompt caching with probe request filter
eloe 792bb8a
fix: require substantial prefix match for KV cache reuse
eloe d49397f
style: apply black, isort, autoflake formatting
eloe 5250deb
fix: stale KV cache recovery + TTL eviction + sanitized errors
eloe 0396983
fix: use offset instead of keys.shape for TurboQuant cache validation
eloe 617720c
fix: add default repetition penalty to prevent MoE degeneration loops
eloe 2391fca
fix: normalize Responses API tool format for Jinja chat templates
eloe a398f44
debug: add request logging to responses endpoint for tool call invest…
eloe 0a5bd0d
debug: log formatted prompt tail for tool call investigation
eloe c471efc
debug: log tool call detection in streaming responses
eloe cc2e09a
feat: add --default-max-tokens CLI flag for server-side token limit
eloe c89ce53
fix: address code review and security findings
eloe a2befaf
fix: address Copilot review findings
eloe eacb558
fix: address Copilot review round 3
eloe File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream_generate()prints cache reuse/validation failures unconditionally. In production these paths can be hit during normal operation (TTL eviction, model reloads, cache mismatches), potentially spamming stdout and making logs hard to use. Consider routing these through the server’s verbosity/logging mechanism (e.g., logger.debug) or making them conditional on an explicit debug flag.