Skip to content

fix(openai-client): salvage reasoning_content when message.content is empty#78

Open
rafaelreis-r wants to merge 3 commits intoviperrcrypto:mainfrom
rafaelreis-r:pr/fallback-reasoning-content
Open

fix(openai-client): salvage reasoning_content when message.content is empty#78
rafaelreis-r wants to merge 3 commits intoviperrcrypto:mainfrom
rafaelreis-r:pr/fallback-reasoning-content

Conversation

@rafaelreis-r
Copy link
Copy Markdown
Contributor

Summary

llama-server and other OpenAI-compatible servers fronting reasoning models (Qwen3, GLM-4, etc.) split responses into message.content and message.reasoning_content. When a model hits max_tokens before emitting the closing </think> token, content is empty and the actual answer is in reasoning_content.

This currently surfaces as No text content in AI response failures for self-hosted users behind llama-swap, even though the model produced the requested JSON inside its reasoning trace.

When that happens, scan reasoning_content for the last JSON array/object and use it as the response text. Callers (categorizeBatch, enrichBatchSemanticTags) already extract the last [...]/{...} block from the response, so fishing it out a step earlier is the smallest unblocking change.

Test plan

  • Standard OpenAI / non-thinking response (content populated): unchanged
  • llama-server response with empty content and JSON in reasoning_content: text is salvaged

🤖 Generated with Claude Code

… empty

llama-server (and other OpenAI-compatible servers fronting reasoning
models — Qwen3, GLM-4, gemma3, etc.) split the response into
\`message.content\` and \`message.reasoning_content\`. When the model
hits \`max_tokens\` before emitting the closing \`</think>\` token, the
client sees empty \`content\` and the actual answer sits in
\`reasoning_content\` instead.

When that happens, scan reasoning_content for the last JSON array or
object and use it as the response text. This matches the parsing that
callers (\`categorizeBatch\`, \`enrichBatchSemanticTags\`) already do on
the response — they look for the last \`[...]\`/\`{...}\` block — so
fishing it out a step earlier is the smallest unblocking change.

Effect: self-hosted setups behind llama-swap with thinking-prone models
no longer fail every batch with "No text content in AI response".
…g fallback

The first regex was too greedy and matched non-JSON brackets like
"[link]" that appear in markdown placeholders inside reasoning text.
Replace with brace-balanced scan that finds all candidate substrings
and tries JSON.parse on each in descending length order, returning the
first one that parses.
Self-hosted OpenAI-compat servers (llama-server, vLLM, ollama) often
emit a <think>…</think> reasoning block before the answer. The
caller's max_tokens is sized for the answer alone, so reasoning
consumes the entire budget and content stays empty.

When OPENAI_BASE_URL is set (proxy in use), auto-bump max_tokens to
≥ 8192. Tunable via OPENAI_MIN_MAX_TOKENS env var. No behavior change
when calling the real OpenAI API (no OPENAI_BASE_URL env).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant