fix(openai-client): salvage reasoning_content when message.content is empty#78
Open
rafaelreis-r wants to merge 3 commits intoviperrcrypto:mainfrom
Open
fix(openai-client): salvage reasoning_content when message.content is empty#78rafaelreis-r wants to merge 3 commits intoviperrcrypto:mainfrom
rafaelreis-r wants to merge 3 commits intoviperrcrypto:mainfrom
Conversation
… empty
llama-server (and other OpenAI-compatible servers fronting reasoning
models — Qwen3, GLM-4, gemma3, etc.) split the response into
\`message.content\` and \`message.reasoning_content\`. When the model
hits \`max_tokens\` before emitting the closing \`</think>\` token, the
client sees empty \`content\` and the actual answer sits in
\`reasoning_content\` instead.
When that happens, scan reasoning_content for the last JSON array or
object and use it as the response text. This matches the parsing that
callers (\`categorizeBatch\`, \`enrichBatchSemanticTags\`) already do on
the response — they look for the last \`[...]\`/\`{...}\` block — so
fishing it out a step earlier is the smallest unblocking change.
Effect: self-hosted setups behind llama-swap with thinking-prone models
no longer fail every batch with "No text content in AI response".
…g fallback The first regex was too greedy and matched non-JSON brackets like "[link]" that appear in markdown placeholders inside reasoning text. Replace with brace-balanced scan that finds all candidate substrings and tries JSON.parse on each in descending length order, returning the first one that parses.
Self-hosted OpenAI-compat servers (llama-server, vLLM, ollama) often emit a <think>…</think> reasoning block before the answer. The caller's max_tokens is sized for the answer alone, so reasoning consumes the entire budget and content stays empty. When OPENAI_BASE_URL is set (proxy in use), auto-bump max_tokens to ≥ 8192. Tunable via OPENAI_MIN_MAX_TOKENS env var. No behavior change when calling the real OpenAI API (no OPENAI_BASE_URL env).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
llama-server and other OpenAI-compatible servers fronting reasoning models (Qwen3, GLM-4, etc.) split responses into
message.contentandmessage.reasoning_content. When a model hitsmax_tokensbefore emitting the closing</think>token,contentis empty and the actual answer is inreasoning_content.This currently surfaces as
No text content in AI responsefailures for self-hosted users behind llama-swap, even though the model produced the requested JSON inside its reasoning trace.When that happens, scan
reasoning_contentfor the last JSON array/object and use it as the response text. Callers (categorizeBatch,enrichBatchSemanticTags) already extract the last[...]/{...}block from the response, so fishing it out a step earlier is the smallest unblocking change.Test plan
contentpopulated): unchangedcontentand JSON inreasoning_content: text is salvaged🤖 Generated with Claude Code