fix(embedder): auto-retry with smaller chunks when input exceeds model limit#736
fix(embedder): auto-retry with smaller chunks when input exceeds model limit#736deepakdevp wants to merge 2 commits intovolcengine:mainfrom
Conversation
…l limit When the embedding API rejects input as "too large" (common with non-OpenAI models where token estimation is inaccurate), retry with chunking at half the current max_tokens instead of crashing. Also logs a warning guiding users to set embedding.dense.max_tokens in ov.conf to match their model's actual limit. Fixes volcengine#731. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qin-ptr
left a comment
There was a problem hiding this comment.
Review Summary
This PR fixes a crash when embedding input exceeds model limits by adding automatic retry with smaller chunks. However, there is a critical concurrency bug that must be fixed before merging.
See inline comments for details.
Blocking issue: The implementation modifies instance attribute _max_tokens which is not thread-safe when embed() is called concurrently via asyncio.to_thread() (as seen in collection_schemas.py:251).
| reduced = max(self.max_tokens // 2, 128) | ||
| logger.warning( | ||
| f"Embedding failed due to input length. " | ||
| f"Retrying with chunk size {reduced} tokens. " |
There was a problem hiding this comment.
[Bug] (blocking)
Concurrency Safety Issue: Directly modifying the instance attribute self._max_tokens is not thread-safe.
From collection_schemas.py:251-252, embed() is called via asyncio.to_thread() which allows concurrent execution in a thread pool. If multiple threads call embed() simultaneously:
- Thread A sets
self._max_tokens = reduced - Thread B reads
self.max_tokens(in_chunk_and_embed()or_chunk_text()) and gets the wrong value - Thread A restores
self._max_tokensin finally, potentially overwriting Thread B's modification
Suggested Fix: Pass max_tokens as a parameter instead of modifying the instance attribute.
Option 1: Add optional parameter to _chunk_and_embed():
def _chunk_and_embed(self, text: str, is_query: bool = False, override_max_tokens: Optional[int] = None) -> EmbedResult:
max_tok = override_max_tokens if override_max_tokens is not None else self.max_tokens
# Use max_tok instead of self.max_tokens
# In retry logic:
reduced = max(self.max_tokens // 2, 128)
return self._chunk_and_embed(text, is_query=is_query, override_max_tokens=reduced)This approach is thread-safe because each thread uses its own local variable instead of modifying shared state.
There was a problem hiding this comment.
Fixed. Added override_max_tokens parameter to _chunk_and_embed() and _chunk_text() in base.py. The retry now passes override_max_tokens=reduced instead of mutating self._max_tokens. Thread-safe.
| except RuntimeError as e: | ||
| error_msg = str(e).lower() | ||
| if ( | ||
| "too large" in error_msg |
There was a problem hiding this comment.
[Suggestion] (non-blocking)
Error Message Matching May Be Too Broad: The current check for "too large", "too long", or "maximum context length" might match unrelated errors:
- "Request body too large" (not a token length issue)
- "File too large" (not an embedding input issue)
Consider more precise matching:
if (
("input" in error_msg and "too large" in error_msg)
or ("token" in error_msg and ("too long" in error_msg or "too many" in error_msg))
or "context length" in error_msg
):Or check for specific API error codes if available.
There was a problem hiding this comment.
Fixed. Tightened error matching to require 'input' + 'too large', 'token' + 'too long/many', or 'context length'. Won't match unrelated 'Request body too large' errors.
…ching
- Add override_max_tokens parameter to _chunk_and_embed() and
_chunk_text() instead of mutating self._max_tokens (thread-safe)
- Tighten error message matching to require specific patterns
("input" + "too large", "token" + "too long/many", "context length")
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
max_tokensinstead of crashingembedding.dense.max_tokensinov.confFixes #731
Root Cause
The default
max_tokensis 8000 (designed for OpenAI models). Users with models that have smaller limits (e.g.,bce-embedding-base_v1with 512 max tokens) hit errors because:len(text) // 3for non-OpenAI models (no tiktoken), which underestimates CJK text_chunk_and_embed()creates chunks ofmax_tokenssize, but each chunk still exceeds the model's actual limitFix
In
OpenAIDenseEmbedder.embed(), catch "too large/too long/maximum context length" errors from_embed_single()and retry with_chunk_and_embed()at half themax_tokens. This handles both:Changes Made
openviking/models/embedder/openai_embedders.py: Added error-driven retry inembed()with reduced chunk size and user-facing warningType of Change
Testing