fix(embedder): add unified input truncation to prevent token limit errors#760
fix(embedder): add unified input truncation to prevent token limit errors#760yangxinxin-7 wants to merge 4 commits intovolcengine:mainfrom
Conversation
…rors Add truncate_text_by_tokens() with tiktoken (cl100k_base) as primary tokenizer, utf-8 byte count as fast path, and byte-based truncation as fallback. Apply _truncate_input() across all embedder implementations (OpenAI, Jina, Voyage, Volcengine, VikingDB) with per-model token limits. Fixes volcengine#616 Fixes volcengine#686 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove unused `import re` in base.py - Add warning log when tiktoken initialization fails instead of silent pass - Fix double blank line (PEP 8) - Use identity check (`is not`) in _truncate_input to avoid full string comparison on long texts - Add double-checked locking for thread-safe tiktoken singleton initialization - Cache `_get_max_input_tokens()` result in VikingDB embedder __init__ to avoid repeated string matching per batch item Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use sentinel _TIKTOKEN_NOT_AVAILABLE to cache failed tiktoken init, preventing repeated import attempts and log spam on every embed() call when tiktoken is unavailable - Replace `is not` identity check with `len(truncated) < len(text)` in _truncate_input to correctly detect truncation regardless of string object identity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fast path (utf-8 bytes <= max_tokens) is safe for ALL models including BGE, since token count <= utf-8 byte count holds universally - When fast path misses, cl100k_base may undercount BGE SentencePiece tokens (~1.76 tokens/CJK char vs cl100k_base merging); mitigated by conservative per-model limits (bge-large-zh=500, bge-m3=8000) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
qin-ptr
left a comment
There was a problem hiding this comment.
Summary
This PR fixes embedding failures when input text exceeds model token limits (closes #616, #686). The implementation adds truncate_text_by_tokens() with a three-tier strategy (UTF-8 fast path → tiktoken → byte fallback) and applies _truncate_input() across all embedder implementations.
Overall assessment: The truncation implementation is solid and well-tested, but there are two blocking issues that must be addressed before merging.
Blocking Issues
1. CI lint failure
The lint / lint check is failing. This must be fixed before the PR can be merged.
2. Incomplete fix for issue #686
Location: openviking/models/embedder/base.py (_truncate_input method)
Issue #686 identifies two root causes:
- No input truncation guard (✅ fixed by this PR)
- Unhandled exception in embedding queue blocks uvicorn (❌ not addressed)
From issue #686's logs:
RuntimeError: OpenAI API error: Error code: 400 - ...
This exception propagates from openai_embedders.py:embed() through collection_schemas.py:on_dequeue and appears to block the async event loop, making the entire HTTP server unresponsive.
Why this matters: Even with truncation in place, other embedding errors (network failures, API rate limits, etc.) could still crash the server.
Recommendation: Either:
- Add exception handling in
collection_schemas.py:on_dequeueto gracefully handle embedding failures without blocking the event loop - Or add a TODO comment and create a follow-up issue to track this work
Non-blocking Suggestions
3. Design choice: truncation vs chunking
Location: openviking/models/embedder/base.py (truncate_text_by_tokens function)
Simple truncation causes information loss. Related issues (#530 "long memory should use chunked vectorization", #531 "truncation vs chunking responsibilities") discuss chunking as a better long-term solution.
Suggestion:
- Document in the PR description or code comments why truncation was chosen over chunking
- Or add a config option to let users choose behavior (truncate / error / future chunking support)
4. Commit organization
The 4 commits should be squashed into one. Commits 2-4 ("fix code review issues", "fix tiktoken sentinel", "docs clarify") are incremental fixes to the initial implementation and should be merged for a cleaner git history.
5. Warning log detail
Location: openviking/models/embedder/base.py:197
The truncation warning should include original length:
logging.warning(
f"[{self.__class__.__name__}] Input truncated from {len(text)} chars "
f"to {self.max_input_tokens} tokens (result: {len(truncated)} chars)"
)6. Backward compatibility impact
Location: PR description
Should document that truncation changes embedding results. For existing vector databases, re-embedding the same text will produce different vectors (if text exceeded limits), potentially affecting query results.
7. Configurable token limits
Location: openviking/models/embedder/vikingdb_embedders.py (VIKINGDB_MODEL_MAX_TOKENS)
Token limits are hardcoded. Consider allowing users to override these in config for different deployment scenarios.
Highlights
- Comprehensive test coverage: 15 test cases covering ASCII/CJK text, special tokens, tiktoken fallback, UTF-8 boundaries
- Thread safety: Double-checked locking for tiktoken singleton
- Performance optimization: UTF-8 fast path avoids unnecessary tokenization; VikingDB caches token limit lookup
- Graceful fallback: Byte-based truncation when tiktoken is unavailable
- Clear documentation: Explains cl100k_base limitations for non-OpenAI models
Recommendation
REQUEST_CHANGES due to:
- CI lint failure (must fix)
- Incomplete fix for issue #686's second root cause (exception handling)
Once these are addressed, the PR is ready to merge. The truncation approach, while not optimal long-term (chunking is better), is acceptable as a quick fix.
🤖 I am a bot owned by @qin-ctx.
Summary
Fixes #616 and #686: embedding requests fail when input text exceeds model token limits.
truncate_text_by_tokens(text, max_tokens)tobase.pywith three-layer truncation:max_input_tokensproperty and_truncate_input()toEmbedderBase; all embedders call it atembed()/embed_batch()entryPer-model token limits
Test plan
tests/unit/test_truncate_input.py— 15 cases (all passing):<|endoftext|>) do not raiseCloses #616
Closes #686