fix(embedder): add unified input truncation to prevent token limit errors by yangxinxin-7 · Pull Request #760 · volcengine/OpenViking

yangxinxin-7 · 2026-03-19T04:03:48Z

Summary

Fixes #616 and #686: embedding requests fail when input text exceeds model token limits.

Add truncate_text_by_tokens(text, max_tokens) to base.py with three-layer truncation:
1. Fast path: if utf-8 byte count ≤ max_tokens, return immediately (provably safe: token count ≤ utf-8 bytes for BPE)
2. tiktoken path: encode with cl100k_base, slice to max_tokens, decode
3. Fallback: byte-based truncation when tiktoken is unavailable
Add max_input_tokens property and _truncate_input() to EmbedderBase; all embedders call it at embed() / embed_batch() entry

Per-model token limits

Embedder	Limit
OpenAI	8000 (default)
Jina v5-text-small	32000
Jina v5-text-nano	8000
Voyage	8000 (default)
Volcengine (vision)	8000
Volcengine (non-vision)	4000
VikingDB bge-large-zh	500
VikingDB bge-m3 / bge-visualized-m3	8000
VikingDB doubao-embedding-large	4000
VikingDB doubao-embedding-vision	8000
VikingDB doubao-embedding	4000

Test plan

New test file tests/unit/test_truncate_input.py — 15 cases (all passing):
- ASCII and CJK text truncated within token limit
- Special token strings (e.g. <|endoftext|>) do not raise
- tiktoken unavailable → byte-based fallback
- Short text takes utf-8 fast path without calling tiktoken
- Warning logged when truncation occurs
- Truncated result is always valid UTF-8

Closes #616
Closes #686

…rors Add truncate_text_by_tokens() with tiktoken (cl100k_base) as primary tokenizer, utf-8 byte count as fast path, and byte-based truncation as fallback. Apply _truncate_input() across all embedder implementations (OpenAI, Jina, Voyage, Volcengine, VikingDB) with per-model token limits. Fixes volcengine#616 Fixes volcengine#686 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove unused `import re` in base.py - Add warning log when tiktoken initialization fails instead of silent pass - Fix double blank line (PEP 8) - Use identity check (`is not`) in _truncate_input to avoid full string comparison on long texts - Add double-checked locking for thread-safe tiktoken singleton initialization - Cache `_get_max_input_tokens()` result in VikingDB embedder __init__ to avoid repeated string matching per batch item Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Use sentinel _TIKTOKEN_NOT_AVAILABLE to cache failed tiktoken init, preventing repeated import attempts and log spam on every embed() call when tiktoken is unavailable - Replace `is not` identity check with `len(truncated) < len(text)` in _truncate_input to correctly detect truncation regardless of string object identity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fast path (utf-8 bytes <= max_tokens) is safe for ALL models including BGE, since token count <= utf-8 byte count holds universally - When fast path misses, cl100k_base may undercount BGE SentencePiece tokens (~1.76 tokens/CJK char vs cl100k_base merging); mitigated by conservative per-model limits (bge-large-zh=500, bge-m3=8000) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

qin-ptr

Summary

This PR fixes embedding failures when input text exceeds model token limits (closes #616, #686). The implementation adds truncate_text_by_tokens() with a three-tier strategy (UTF-8 fast path → tiktoken → byte fallback) and applies _truncate_input() across all embedder implementations.

Overall assessment: The truncation implementation is solid and well-tested, but there are two blocking issues that must be addressed before merging.

Blocking Issues

1. CI lint failure

The lint / lint check is failing. This must be fixed before the PR can be merged.

2. Incomplete fix for issue #686

Location: openviking/models/embedder/base.py (_truncate_input method)

Issue #686 identifies two root causes:

No input truncation guard (✅ fixed by this PR)
Unhandled exception in embedding queue blocks uvicorn (❌ not addressed)

From issue #686's logs:

RuntimeError: OpenAI API error: Error code: 400 - ...

This exception propagates from openai_embedders.py:embed() through collection_schemas.py:on_dequeue and appears to block the async event loop, making the entire HTTP server unresponsive.

Why this matters: Even with truncation in place, other embedding errors (network failures, API rate limits, etc.) could still crash the server.

Recommendation: Either:

Add exception handling in collection_schemas.py:on_dequeue to gracefully handle embedding failures without blocking the event loop
Or add a TODO comment and create a follow-up issue to track this work

Non-blocking Suggestions

3. Design choice: truncation vs chunking

Location: openviking/models/embedder/base.py (truncate_text_by_tokens function)

Simple truncation causes information loss. Related issues (#530 "long memory should use chunked vectorization", #531 "truncation vs chunking responsibilities") discuss chunking as a better long-term solution.

Suggestion:

Document in the PR description or code comments why truncation was chosen over chunking
Or add a config option to let users choose behavior (truncate / error / future chunking support)

4. Commit organization

The 4 commits should be squashed into one. Commits 2-4 ("fix code review issues", "fix tiktoken sentinel", "docs clarify") are incremental fixes to the initial implementation and should be merged for a cleaner git history.

5. Warning log detail

Location: openviking/models/embedder/base.py:197

The truncation warning should include original length:

logging.warning(
    f"[{self.__class__.__name__}] Input truncated from {len(text)} chars "
    f"to {self.max_input_tokens} tokens (result: {len(truncated)} chars)"
)

6. Backward compatibility impact

Location: PR description

Should document that truncation changes embedding results. For existing vector databases, re-embedding the same text will produce different vectors (if text exceeded limits), potentially affecting query results.

7. Configurable token limits

Location: openviking/models/embedder/vikingdb_embedders.py (VIKINGDB_MODEL_MAX_TOKENS)

Token limits are hardcoded. Consider allowing users to override these in config for different deployment scenarios.

Highlights

Comprehensive test coverage: 15 test cases covering ASCII/CJK text, special tokens, tiktoken fallback, UTF-8 boundaries
Thread safety: Double-checked locking for tiktoken singleton
Performance optimization: UTF-8 fast path avoids unnecessary tokenization; VikingDB caches token limit lookup
Graceful fallback: Byte-based truncation when tiktoken is unavailable
Clear documentation: Explains cl100k_base limitations for non-OpenAI models

Recommendation

REQUEST_CHANGES due to:

CI lint failure (must fix)
Incomplete fix for issue #686's second root cause (exception handling)

Once these are addressed, the PR is ready to merge. The truncation approach, while not optimal long-term (chunking is better), is acceptable as a quick fix.

🤖 I am a bot owned by @qin-ctx.

yangxinxin-7 and others added 2 commits March 19, 2026 11:54

github-project-automation bot moved this to Backlog in OpenViking project Mar 19, 2026

github-project-automation bot added this to OpenViking project Mar 19, 2026

yangxinxin-7 and others added 2 commits March 19, 2026 12:09

qin-ptr suggested changes Mar 19, 2026

View reviewed changes

yangxinxin-7 closed this Mar 19, 2026

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embedder): add unified input truncation to prevent token limit errors#760

fix(embedder): add unified input truncation to prevent token limit errors#760
yangxinxin-7 wants to merge 4 commits intovolcengine:mainfrom
yangxinxin-7:fix/embedding-input-truncation

yangxinxin-7 commented Mar 19, 2026

Uh oh!

qin-ptr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yangxinxin-7 commented Mar 19, 2026

Summary

Per-model token limits

Test plan

Uh oh!

qin-ptr left a comment

Choose a reason for hiding this comment

Summary

Blocking Issues

1. CI lint failure

2. Incomplete fix for issue #686

Non-blocking Suggestions

3. Design choice: truncation vs chunking

4. Commit organization

5. Warning log detail

6. Backward compatibility impact

7. Configurable token limits

Highlights

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants